A cure for cancer? Fighting the climate crisis? Ask the data scientists

08. 09. 2025

The days of the “IT crowd” banished to the basement are long gone. Take bioinformaticians, for example – they’ve moved up to the top floors and become an integral part of research teams. They’re well versed in AI tools and know their way around data. “Thanks to them, we’re able to better predict and tackle threats like climate change or global epidemics such as COVID-19,” says Jiří Vondrášek from the Institute of Organic Chemistry and Biochemistry of the CAS, who heads the Czech branch of the ELIXIR infrastructure for bioinformatics.

The Nobel Prizes in Chemistry and Physics 2024 go to… artificial intelligence. That’s how last year’s decision of the Royal Swedish Academy of Sciences could be summed up in a nutshell. In both scientific fields, the prestigious award went to researchers who pushed the boundaries of AI and its applications in science. The Nobel Prize in Physics was awarded to John Hopfield and Geoffrey Hinton for developing machine learning based on artificial neural networks. Thanks to their work, we now take for granted things like translation apps on our phones – or getting answers to all-encompassing questions from ChatGPT.

But the story of the Nobel Prize in Chemistry 2024 is perhaps even more intriguing. It went to biochemist and bioinformatician David Baker for his contributions to designing new proteins, and to computer scientists Demis Hassabis and John Jumper from the tech company DeepMind for the computational tool and AI model AlphaFold. The Nobel Committee’s decision made it crystal clear: experts in IT (information technology) not only belong in science, but have enormous potential to drive it forward by leaps and bounds.

AlphaFold is an AI model that can predict, with astonishing accuracy, how a protein will fold – based solely on the sequence of its amino acids.

A data revolution in biology
Talk to any bioinformatician about their field, and sooner or later AlphaFold will come up. That’s because it really is a revolutionary tool, one that in just a few short years has managed to speed up and refine the determination of existing protein structures and made it possible to design entirely new proteins.

A protein is a molecule that consists of a chain of amino acids. It’s one of the basic building blocks of all living things. Each protein has its own specific function, which depends on how the amino acid chain folds in three-dimensional space. For decades, scientists struggled to figure out how nature produces these structures. In the technical jargon, the challenge became known as the “protein folding problem.”

Gradually, biologists uncovered one protein structure after another – at first mainly thanks to methods like X-ray diffraction (determining atomic composition using X-rays) and nuclear magnetic resonance (NMR). Crucially, the results of these observations and experiments were recorded in databases (such as PDB or UniProt) that are freely accessible.

Protein
Proteins are the essential building blocks of all living organisms. They consist of amino acids and perform key functions such as building muscle, supporting immunity, and regulating hormones.

Decades of work have yielded information on more than 200 million protein sequences and 200,000 protein structures. That’s an enormous amount of data – far too much for the human brain to handle, but perfect fuel for artificial intelligence tools.

Feeding AI – the right way
“The breakthrough in predicting protein structures was possible only thanks to these databases. Without the public data that scientists painstakingly collected and then voluntarily shared with the global research community, tools like AlphaFold would never have come into existence,” explains Jiří Vondrášek from the Institute of Organic Chemistry and Biochemistry of the CAS.

Robust datasets serve as training material for AI, which uses them to fine-tune its algorithms. But in order to get good output from AI tools, you have to feed them quality input. “These days, you often hear about the concept of ‘garbage in, garbage out.’ What that means is that no model or system will ever be smarter than its inputs. In other words – even the best AI will spew nonsense if you feed it bad data,” Vondrášek adds.

Jiří Vondrášek from the Institute of Organic Chemistry and Biochemistry of the CAS. (CC)

So far, we’ve been talking about protein databases that contain millions of sequences and hundreds of thousands of molecular structures. But plenty of other datasets exist, too – for instance, DNA sequences collected from various environments (oceans, soils, the air, and so on). With the help of AI tools, these datasets can reveal entirely new organisms, processes, reactions, and compounds that had previously escaped human notice.

There are also databases of plant DNA sequences and of fungi. The Czech project GlobalFungi, for example, contains records of fungi from more than 80,000 locations around the world. Thanks to it, scientists estimate that Earth is home to some six million species of fungi. The number of databases keeps growing. Yet already at the turn of the millennium, it became clear that rules would be needed to keep them in order and to set global standards for their control, management, maintenance, and use.

ELIXIR as a guarantee of quality
About fifteen years ago, people started talking about the so-called data lifecycle – covering everything from data planning and data collection through processing, analysis, storage, sharing, and reusing. The aim was to make sure that funding spent on research wouldn’t go to waste – for instance, by avoiding duplication of data that someone else had already created.

BIOINFORMATICS – A FIELD FOR THE 21ST CENTURY

As recently as 2010, not a single Czech university offered a program in bioinformatics. The field hasn’t really taken off here until this past decade, with interest among young people skyrocketing. Since 2017, it has been possible to study bioinformatics at Charles University and the University of Chemistry and Technology in Prague. In September 2024, Masaryk University in Brno launched a new program in bioinformatics, and the subject is now also taught at Palacký University in Olomouc, the Czech Technical University in Prague, and Brno University of Technology.

One of the founders of bioinformatics, Philip Bourne, estimated in 2016 that projects funded by the U.S. National Institutes of Health (NIH) had already generated 650 petabytes of data – yet only about 12 percent of it was available in NIH’s public archives. (For comparison: one petabyte could hold roughly 250 billion standard smartphone photos.) The vast majority of the data could thus be labeled “dark” or lost. In response, the NIH has spent 1.2 billion dollars in the second decade of the 21st century to support data archives and their management.

The bioinformatics community in Europe was thinking along the same lines. In 2013, it joined forces to form the ELIXIR infrastructure. One of its major contributions to global data management was the formulation of principles known by the acronym FAIR: findable, accessible, interoperable, reusable. In other words, the goal is to make scientific data easy to locate, available, capable of interconnection, and reusable in different contexts and formats.

“ELIXIR guarantees that all the data gathered in its databases carry a stamp of quality and can be trusted,” says Vondrášek, who heads the Czech node of the European infrastructure, ELIXIR CZ.

THE DATA PIPELINE

“We’re heading into a time when data will grow faster than you can imagine. Are you ready for that? If not, I invite you to join the initiative we’re preparing – and we hope as many European countries as possible will get on board,” Jiří Vondrášek paraphrases, some 15 years later, the words of Janet Thornton, the British structural biologist and director emeritus of the European Bioinformatics Institute in Cambridge. The initiative she was talking about was ELIXIR, the European infrastructure for sharing and managing biological data. The Czech Republic became one of its founding members in 2013. “Back then, bioinformatics as a field was only just starting here. Besides me, there was Jan Pačes from the Institute of Molecular Genetics of the CAS and a few other people from universities. At the beginning, we were quite an eclectic bunch,” the data scientist recalls with a smile. Each member country builds and maintains its own national data network – its own “node.” The Czech one is called ELIXIR CZ, which is celebrating its tenth anniversary in 2025.

Tackling pandemics – or cancer
Artificial intelligence can be a good servant but a bad master. After all, all the men mentioned at the start of this article – the Nobel laureates behind machine learning and AI tools – have publicly and openly warned of its risks and stressed the need for ethical boundaries in its use. Vondrášek takes the same view: data scientists are well aware of the dangers of misuse in the world of big data and pay close attention to security. At the same time, he believes that the creation of AlphaFold just heralded the beginning, and that AI has much more good to offer humanity. “I believe it will help us better predict and solve challenges such as climate threats or global epidemics like COVID,” the bioinformatician says.

ELIXIR can also be described figuratively as a kind of data pipeline, streaming information from countless directions and sources. The infrastructure helps researchers navigate data, provides training, offers cloud services, and facilitates collaborative data endeavors across Europe. Its potential is enormous. With new AI tools, it should become possible to use large datasets to develop, for example, drugs for rare diseases as well as genetic or cancer-related conditions.

“We don’t have to know in advance exactly what the end result will be. What we do know for sure is that we have quality feedstock for the fast-evolving tools of artificial intelligence,” Vondrášek concludes. In the near future, we may well see more Nobel Prizes awarded in the field of big data. One thing is certain: behind the groundbreaking discoveries will stand teams whose core members include experts in digital technologies and information resources – or, facetiously put, the IT crowd.

Prof. RNDr. Jiří Vondrášek, CSc.
Institute of Organic Chemistry and Biochemistry of the CAS

Jiří Vondrášek has always enjoyed searching for connections in the flood of information. That’s why he studied at the Faculty of Mathematics and Physics of Charles University in Prague, and from the outset tied his career to computational modeling and data analysis in molecular biology. He is one of the pioneers of bioinformatics in the Czech Republic, a field he sees as having great potential. He leads the bioinformatics research group at the Institute of Organic Chemistry and Biochemistry of the CAS and is director of ELIXIR CZ, the Czech national infrastructure for biological data. He is actively involved in shaping the national strategy for scientific data management and, in cooperation with European partners, in establishing data standards for biological research.

The article was first published in the 2/2025 Czech issue of the quarterly A / Magazine of the CAS:

2/2025 (version for browsing)
2/2025 (version for download)

Written by: Leona Matušková, External Relations Division, CAO of the CAS
Translated by: Tereza Novická, External Relations Division, CAO of the CAS
Photo: Jana Plavec, External Relations Division, CAO of the CAS; Shutterstock

The text and photos marked CC (and the bio profile photo) are released for use under a Creative Commons license.

The Czech Academy of Sciences (the CAS)

The mission of the CAS

The primary mission of the CAS is to conduct research in a broad spectrum of natural, technical and social sciences as well as humanities. This research aims to advance progress of scientific knowledge at the international level, considering, however, the specific needs of the Czech society and the national culture.

President of the CAS

Prof. Radomír Pánek started his first term of office in March 2025. He is a prominent Czech scientist specializing in plasma physics and nuclear fusion.

A cure for cancer? Fighting the climate crisis? Ask the data scientists

Read also

The Czech Academy of Sciences (the CAS)