How Data is Reshaping Science – Part 2: The Rise of Scientific AI Models
<p>We went upstream in Part 1. The lab used to come first. Now it often comes last. They begin with simulations, with data, with models that simulate thousands of trials before […]</p> <p>The po...
We went upstream in Part 1. The lab used to come first. Now it often comes last. They begin with simulations, with data, with models that simulate thousands of trials before anything is built or touched. That shift was already reshaping the way we test ideas. But now the ideas themselves are coming from a different layer of the process.
We are witnessing the rise of science AI models. They are not just instruments for simulating or sorting lab data, but models trained on the science itself — its language and structure and memory. They learn from how science is done, then try to do parts of it themselves. In this part, we explore that shift. What it means to build machines that don’t just support discovery, but try to internalize the process of thinking like a scientist.
Beyond Output: Teaching AI to Think Like a Scientist
The shift toward science-trained models is already underway. Some systems are now being exposed to the full stack of scientific output, not just final results but also the reasoning and structure that binds the parts of a scientific paper together. What is being modeled is not just knowledge. It’s how we get to that knowledge.
Meta’s Galactica attempted to get this at scale. It was trained on tens of millions of scientific papers and textbooks, programmed to generate, quote and complement research. For a moment, it seemed like a real breakthrough, until it didn’t. Within three days of going live, researchers were highlighting hallucinated citations, confident nonsense and fake papers that appeared real enough to catch casual readers. The model did not technically fail, but it failed when it was called on to be trusted. Meta promptly removed it. Still, it was a turning point. It showed that scale alone isn’t enough — scientific AI needs grounding, transparency, and proof.
Other developers are taking different approaches to creating a reliable scientific AI model. ProteinLM is teaching itself about proteins through a process of trial and error, discovering biochemical patterns that it was not trained on. MatSciBERT and its deep learning cousin CrabNet apply scientific language to composition data in order to predict which properties seem promising — and which ones tend to fall apart.
Some models are designed to digest various inputs with the aim to have more grounded results. MoLFormer combines chemical graphs to text in order to map the structure and description. Finch and PaperQA review entire research papers, following chains of logic or critiquing weak assumptions that might escape detection with a keyword search.
This new generation of models isn’t just defined by size. It’s shaped by what they’re exposed to. They’re trained on the full scaffolding of research — methodology, citations, figures, captions, commentary. They start to notice the subtleties: how confidence is conveyed, how evidence is built up, how doubt is expressed without derailing the argument. What they’re learning isn’t just content, but the tone and rhythm of scientific thinking. They’re beginning to operate at the level where thought is formed, not just summarized — and that shift may matter more than any specific result they produce.
When Data Becomes the Method
The rise of scientific AI It’s not just about smarter algorithms, it’s also about what they’re being built on. These models aren’t learning from textbooks or curated collections of data. They’re picking up from the messy residue of science itself — the raw, often fragmented traces that researchers leave behind in papers. Data holds the key, and this is data with memory.
A figure in the results section is never just a chart. It shows what the scientist thought was worth sharing. Maybe it stood out. Maybe it supported the point. The discussion section tells you something else. That’s where the thinking continues. The uncertainty, the mixed results, the possible explanations. It’s where the scientist reflects on what worked, what didn’t, and what it might all mean. That’s the kind of material models need to learn from. If they only see the polished answers, they miss the real process. They might learn to repeat conclusions, but not how to think through them. To build models that can reason, we have to show them how reasoning looks.
What goes into these models isn’t neutral. Some areas of science are overrepresented, others barely appear. Some researchers get cited constantly, while others go unnoticed. If we’re not careful, the model ends up learning a distorted version of what science is. It picks up our patterns, our blind spots, and carries them forward. The future it helps us imagine is shaped by the past we chose to feed it.
We actually mentioned this trend recently when we looked at MIT’s CRESt platform, which scoured literature, images and lab results in order to help guide experimental planning. The system wasn’t just trained on outcomes, but on the surrounding context — how results were framed, revised and interpreted.
Collaborative AI in Research
We’ve now seen what scientific models are capable of. Now we come to something less settled — how scientists are really reacting in the process. Are these models being treated as research partners or simply highly skilled assistants? Are scientists using them to blaze into new territory, or following their lead without fully grasping what they’re doing? The answers aren’t always clear. But in an increasing number of labs, models are guiding decisions — not always comfortably, and not always with confidence, but with enough trust to shift the direction of research.
At the National Institute for Materials Science (NIMS) in Japan, AI is doing more than just crunching numbers; it’s helping steer researchers toward promising new areas of study. Their materials informatics team has integrated machine learning into the early stages of material discovery, using models to identify compound candidates before anything is synthesized. Researchers don’t follow the model blindly. But AI is increasingly shaping the conversation.
This is not unique to NIMS. Scientists across fields are grappling with how to collaborate with models that suggest rather than just analyze ideas. In the life sciences, tools like Finch, developed by the Allen Institute for AI, are trained to read full papers, track reasoning chains, and flag soft spots or logical gaps. One team described the interaction as less like querying a search engine and more like debating with a hyper-literate lab partner.
It’s interactive, sometimes messy, and deeply iterative. Scientists prompt, adjust, and occasionally argue with their models. And when the model offers something unexpected — a strange suggestion, an outlier candidate, a line of logic no one had considered — the lab has to pause and ask: do we follow this?
That’s the shift. Discovery is no longer a clean pipeline from idea to test. It’s a dialogue between human intuition and machine-generated possibility — with all the uncertainty that brings.
Then the harder question arises: what do we believe when the model gets ahead of us?
The Epistemic Shift
The more profound story of scientific AI is not just that the models are improving. It’s that data is teaching them how to produce knowledge in ways that challenge long‑standing scientific values. Models trained on enormous datasets learn patterns no human could catalog, let alone examine. When they produce discoveries, they do it through statistical structure, not through theory or direct observation.
This inverts an old hierarchy. The traditional flow of information once supported knowledge. You observed, you reasoned, you confirmed. The strength lay in the connection between explanation and result. The relationship now moves the other way. The model finds something in the data, and we verify afterward whether it holds.
AlphaTensor captured this reversal. It searched the combinatorial space of matrix multiplication through its training data and revealed algorithms that felt unfamiliar at first glance. It wasn’t until mathematicians studied them later that they confirmed the model had identified an optimal approach. The explanation followed the pattern.
Protein design shows similar behavior. A model examines massive biochemical datasets, suggests a new structure, and the lab finds that the protein folds and functions. No one fully understands why the design works. The reason lives inside the data that trained the model.
This is a new empiricism. It relies on data‑driven performance rather than transparent reasoning. The trust does not come from comprehension but from repeated results that appear reliable despite the opacity. It creates discomfort. It also expands the realm of what science can attempt.
Data is no longer just something to analyze. It has become a discovery engine, generating insights that surpass the reach of individual minds. Science has always held wonder and proof in careful balance. Now new data pushes that balance into unfamiliar territory. The opportunity remains real, as long as curiosity grows alongside caution.
In Part 3 next week, we go under the hood — into the data infrastructure powering today’s scientific models. Supercomputers, open datasets, automated labs. Subscribe and follow so you don’t miss it.
The post How Data is Reshaping Science – Part 2: The Rise of Scientific AI Models appeared first on BigDATAwire.


