RAG Doesn't Work for Science (Until You Fix Three Things)
What we learned building an AI research assistant grounded in a knowledge graph instead of documents
Retrieval-augmented generation has a straightforward premise: instead of asking an LLM to answer from memory, retrieve relevant documents first and put them in context. It works well for customer support, internal knowledge bases, and documentation search.
It does not work well for science. At least, not out of the box.
We built Nexus, an AI research assistant for microbiome research. It answers questions like “What gut bacteria are associated with Type 2 Diabetes?” by querying a structured knowledge graph (not searching documents), and it analyzes uploaded datasets by generating and executing code (not by describing what analysis could be done).
Here’s what we had to change to make RAG useful for researchers.
Problem 1: Scientists don’t want “similar documents” — they want precise answers with sources
Standard RAG retrieves text chunks that are semantically similar to the query. For a question like “What is the role of Akkermansia muciniphila in metabolic syndrome?”, a document-based system returns paragraphs from papers that mention both terms. The LLM then synthesizes an answer.
This fails for scientific use in three ways:
Incompleteness: The retrieved papers are a sample, not a census. The researcher has no way to know what was missed.
No structure: “Several studies have shown associations...” is not useful. Researchers want: which studies, what direction, what sample size, what population.
Hallucination risk: The LLM may interpolate between papers, creating claims that no individual paper supports.
Our fix: Instead of retrieving documents, we query a structured knowledge graph. The question “What bacteria are associated with metabolic syndrome?” becomes a graph traversal: find all Taxon nodes connected to the Disease node “Metabolic Syndrome” via association edges. The result is a table: taxon, direction (increased/decreased), source study, confidence.
The LLM’s job changes from “synthesize an answer from fragments” to “explain structured results in natural language.” This is a much easier task with much lower hallucination risk.
We still use document retrieval for questions that require narrative synthesis (”What is the current understanding of gut-brain axis signaling?”), but for factual queries about associations, the graph is the source of truth.
Problem 2: Researchers need computation, not just retrieval
A researcher uploads a BIOM file (a standard microbiome abundance table) and asks: “What’s the alpha diversity of my treatment vs. control groups?”
This isn’t a retrieval problem. There’s nothing to look up. The system needs to:
Parse the BIOM file
Identify sample groups from metadata
Calculate Shannon diversity for each sample
Run a statistical test comparing groups
Generate a visualization
Explain the results
Standard RAG can’t do any of this. We added a code execution layer: the LLM generates Python code (using scikit-bio, pandas, matplotlib), we execute it in a sandboxed container, and results come back as structured data + figures.
This is the riskiest part of the system. LLM-generated code can produce results that are wrong but plausible — a statistical test with swapped groups, a diversity metric with the wrong formula, a plot with mislabeled axes. Our mitigations:
Every result includes the code that produced it, visible to the user
We validate outputs against expected ranges (e.g., Shannon diversity should be between 0 and ~5 for typical microbiome data)
For standard analyses (diversity, ordination, differential abundance), we use tested templates rather than generating code from scratch
Problem 3: Multi-step reasoning requires planning, not just retrieval
“What evidence supports Lactobacillus rhamnosus as a probiotic for atopic dermatitis, and what’s the proposed mechanism?”
Answering this well requires:
Query the knowledge graph for L. rhamnosus ↔ atopic dermatitis associations
Find metabolites produced by L. rhamnosus
Check which metabolites are involved in immune-related pathways
Search literature for clinical trial results
Synthesize the mechanistic pathway: taxon → metabolite → pathway → disease
Standard RAG does one retrieval step. We implemented a multi-step “deep research” mode:
The LLM receives the question and plans a sequence of queries
Each query executes against the knowledge graph or literature search
Intermediate results inform the next query
The final answer synthesizes across all steps, with per-claim citations
This is slower (30-60 seconds vs. 5 seconds for a simple query) but produces dramatically better answers for complex questions.
What still doesn’t work well:
Contradictory evidence: When studies disagree (Study A: increased, Study B: decreased), the system presents both but struggles to explain why they differ (different populations, methods, sample sizes). We’re working on incorporating study metadata to help here.
Recency: The knowledge graph is updated periodically, not continuously. A paper published last week won’t be in the graph. We supplement with real-time PubMed search, but integration is imperfect.
Confidence calibration: When the system says “strong evidence supports...” vs. “some evidence suggests...”, how calibrated is that language? We don’t have a good answer yet.
Evaluation: There’s no standard benchmark for “did the AI give a good scientific answer?” We’re building our own evaluation set with domain experts, but it’s slow and expensive.
The broader lesson:
RAG is a pattern, not a solution. For scientific applications, the retrieval source (graph vs. documents), the generation task (explain results vs. synthesize text), and the reasoning structure (single-step vs. multi-step) all need to be purpose-built. The LLM is the least important component — the knowledge architecture underneath it is what determines whether the system is useful or dangerous.
Try Nexus at researchprod.graphomics.com. We’d genuinely love feedback from researchers on what works and what doesn’t.
Next time: Why “no-code bioinformatics” offends some researchers — and what we learned about designing tools for scientists who can code but don’t want to.
