What We Learned Integrating 12 Biological Databases Into One Knowledge Graph
Taxonomy is a mess, provenance is everything, and entity reconciliation will haunt your dreams
MicroMap, our microbiome knowledge graph, connects data from NCBI Taxonomy, Disbiome, BugSigDB, gutMDisorder, HMDB, KEGG, ChEMBL, Reactome, PubMed, PubChem, and CARD. Building it taught us more about the state of biological data than any paper or course ever could.
This post is about the hard problems — not the ones you’d expect (scale, performance, storage) but the ones that actually consumed most of our engineering time.
The taxonomy problem
In 2020, the genus Lactobacillus was split into 25 genera. Lactobacillus rhamnosus became Lacticaseibacillus rhamnosus. Overnight, every database that referenced the old name became partially outdated.
Some databases updated. Some didn’t. Some partially updated. The result: the same organism appears under different names in different sources. If you’re integrating data across sources and you don’t reconcile these, you’ll either miss associations or double-count them.
Our approach:
Canonical identifiers: Everything normalizes to NCBI Taxonomy IDs. These are the closest thing to a universal identifier in microbiology.
Synonym mapping: We maintain a mapping table of old names → new names, strain-level identifiers → species-level IDs, and abbreviations → full names.
Fuzzy matching: For the long tail, we use approximate string matching with manual review for high-impact taxa (the top ~500 most-studied organisms).
Confidence scores: Every entity mapping gets a confidence score. Exact NCBI ID match = 1.0. Fuzzy name match = 0.7. Genus-level-only match = 0.5. Users can filter by confidence.
We estimate that ~95% of our associations have high-confidence entity mappings. The remaining 5% are flagged but included, because excluding them would lose real signal.
The directionality problem
“Bacteroides fragilis is associated with colorectal cancer.”
Is it increased in CRC patients? Decreased? Present in some studies but not others? Causal or correlational?
This single sentence is almost useless without direction, effect size, and study context. But many databases store associations without this nuance.
How different sources encode direction:
Disbiome: “elevated” or “reduced” — clean, but binary
BugSigDB: Signed effect sizes from differential abundance analysis — rich, but heterogeneous
gutMDisorder: Free-text descriptions — requires NLP to extract direction
We normalized to three categories: increased, decreased, and associated (when direction is genuinely unknown). The original description from each source is preserved so researchers can make their own judgment.
This matters more than it sounds. A researcher studying probiotics for IBD needs to know which taxa are decreased in disease — those are the candidates for supplementation. If your database just says “associated,” it’s not actionable.
The provenance problem
When you traverse a knowledge graph — Taxon → produces Metabolite → involved in Pathway → disrupted in Disease — each edge comes from a different source. The full path might be novel (no paper connects all four entities), but each individual connection is independently supported.
This is simultaneously the most powerful and most dangerous feature of a knowledge graph. Powerful because it surfaces non-obvious connections. Dangerous because a four-hop path with one weak link looks the same as one with four strong links.
Our approach:
Every edge stores: source database, original paper (DOI), confidence score, and extraction method
Path queries return per-edge provenance, not just the path
We never synthesize “overall confidence” for a path — that’s a judgment call for the researcher
The update problem
Biological databases are living things. NCBI Taxonomy updates regularly. New papers appear daily. Disbiome and BugSigDB add new associations. HMDB releases new versions annually.
Our knowledge graph can’t be a static snapshot — it needs to stay current. But re-integrating data on every update risks breaking existing associations, introducing duplicates, or overwriting curated corrections.
We use a versioned ingestion pipeline:
Each data source has an ETL script that runs on its update schedule
New data is loaded into a staging graph, diffed against production
Additions are auto-merged; deletions and changes are flagged for review
Every production update is tagged with a version number
This is still our biggest operational headache. It works, but it’s not elegant.
What we’d tell someone starting a similar project:
Start with identifiers, not names. If you can’t map an entity to a canonical ID, don’t include it. The name-matching rabbit hole is infinite.
Store everything, normalize later. Keep the raw data from each source and build your normalized view on top. You will re-normalize.
Provenance is not optional. In science, “where did this come from?” is the first question anyone asks. Build it into your schema from day one, not as an afterthought.
Expect conflicts. Study A says increased, Study B says decreased. Both are correct (different populations, different methods). Your database needs to hold both and let the user decide.
Budget 3x the time you think you need. Data integration is 20% building pipelines and 80% handling edge cases.
MicroMap API is documented at kgdev.graphomics.com/docs. Ask as for an API key at graphomics.com. If you’re working on a similar integration project, we’d love to compare notes.
Next time: How we built an AI research assistant that queries structured data instead of documents — and why standard RAG doesn’t work for science.
