Eval Reports

We publish our evaluator baselines so customers can audit the quality of the Priolix Evidence API. Numbers below are auto-loaded from the autoresearch pipeline; re-running python3 jobs/autoresearch/run_all_evals.py refreshes this page.

88.8

Overall Composite

Mean of non-skipped evaluator composites from the latest aggregate run. Last refreshed: .

FAQ Generation

98.8

Quality of LLM-generated supplement FAQs (grounding, accuracy, safety, hallucination).

Composite98.8%
Grounding98.8%
Accuracy100.0%
Safety96.2%
Hallucination-free100.0%

Last updated:

Chat Answer Quality

91.2

End-to-end chat answer quality across grounded/safety/misconception edge cases.

Composite91.2%
Grounding85.1%
Accuracy93.9%
Safety96.3%
Hallucination-free89.5%

Last updated:

Evidence Retrieval

87.0

Hybrid (BM25 + dense) retrieval quality on the 148-query v2 eval set (5 production-gap categories + 24 v1 baseline).

Composite87.0%
Recall@1583.6%
Recall@50100.0%
MRR94.6%
nDCG@1586.3%

Production: hybrid retrieval (BM25 + BGE-M3 via Reciprocal Rank Fusion) over 14,344 sources. Cross-encoder reranker (bge-reranker-v2-m3) enabled for queries >=10 words via RERANKER_ROUTER=auto. HyDE + query decomposition shipped behind feature flags, default OFF. Eval set v2 (148 queries) is now default; v1 (24 queries, 87.0 composite) preserved as a tagged sub-bucket for trend continuity.

Last updated:

Biomarker → Evidence Mapping

97.6

Quality of biomarker → condition / supplement / lifestyle mappings used by the lab API.

Composite97.6%
Condition Relevance93.8%
Supplement Evidence99.4%
Lifestyle Fit99.7%

Post-Phase 4. Tier-D backfill (173 PubMed sources for 7 slugs) lifted supplement_evidence to 99.4.

Last updated:

Blood Test Extraction

89.2

Accuracy of GPT-4o vision extraction + cross-reference against curated lab reports.

Composite89.2%
Extraction93.7%
Flagging90.6%
Cross-ref83.3%

Post-Phase 4. Flag accuracy held at 90.6, crossref quality improved to 83.3 (Tier-D promotions benefit supplement evidence quality).

Last updated:

Retrieval — under the hood

Production retrieval combines BM25 (keyword) and dense embeddings via Reciprocal Rank Fusion, over BAAI/bge-m3 on a 14,344-source corpus. Eval set: 24 labelled queries, capped at 15 relevant PMIDs each.

Component contribution (ablation)

What each retrieval leg buys us, holding everything else constant:

ModeCompositeRecall@15Recall@20Recall@50MRR
Hybrid (production) 87.0% 83.6% 91.9% 100.0% 94.6%
Dense Only 79.6% 74.2% 86.1% 98.6% 91.2%
Bm25 Only 72.5% 65.6% 77.2% 95.3% 89.2%

Both legs pull their weight — hybrid beats dense-only by 7.4 composite points and BM25-only by 14.5 points. BM25 catches exact technical strings (PMIDs, drug names, doses); dense catches synonyms and paraphrasing. Their failure modes are complementary.

Embedding bake-off

Holding BM25 + ranking constant, swapping the dense embedder:

ModelDimCompositeRecall@15
BAAI/bge-m3 1024 87.0% 83.6%
BAAI/bge-large-en-v1.5 1024 81.2% 76.1%
NeuML/pubmedbert-base-embeddings 768 80.9% 73.3%

BGE-M3 wins by ~6 composite points over both alternatives. Notably, the biomedical specialist (PubMedBERT) loses to two general-purpose models — modern scale (~2.3 B params) beats domain specialization (~110 M params) on this corpus.

Eval set v2 — per-category breakdown

The legacy 24-query gold set saturated at composite 87.0 (recall@50 = 100%), so we expanded it to 148 queries across five production-gap categories using local Qwen 3.6 35B for both query generation and relevance labelling. The pipeline is fully automated and reproducible — no human review.

Qwen 3.6 35B (a3b) via Ollama; 3-leg pool union (hybrid/bm25_only/dense_only top-50); 6 hard rule filters; no human review

CategoryQueriesCompositeRecall@15MRR
v1 baseline 24 87.0% 83.6% 94.6%
paraphrase 10 82.7% 77.0% 100.0%
typo 6 74.3% 70.0% 86.7%
long natural 60 66.6% 56.5% 91.0%
dose response 20 61.3% 56.0% 81.6%
drug interact 28 54.6% 50.2% 95.5%

v2 still exposes production gaps vs the saturated v1 set, but the gaps are smaller than initially reported because of an ID-matching bug (see id_match_fix). After the fix, drug_interact is 54.6 (vs v1=87.0, ~32 pts headroom) and remains the hardest category. Long_natural and dose_response also have ~20-25 pt headroom. Set EVAL_SET_VERSION=v1 to fall back to the legacy 24-query set. Hybrid vs BM25-only spread: 15.7 pts overall, min 7.2 pts on typo — improvements will be measurable.

How we got here

VersionDateCompositeNote
pre-day1 2026-04-26 47.7% label-bound metric (gold set had >15 PMIDs/query, capping recall@15 mathematically)
post-day1 2026-04-27 87.0% metric fix + multi-query BM25 fan-out + rebalanced ranking weights
post-day3 2026-04-27 87.0% HyDE + decomposition shipped behind flags, kept OFF (HyDE hurts top-15 by 3 pts on this eval)
post-day4 2026-04-27 87.0% embedding bake-off confirmed BGE-M3 wins by 6 pts vs PubMedBERT and BGE-large-en
eval-v2 2026-04-27 60.9% eval set expanded 24->148 queries via local Qwen 3.6 (long_natural, drug_interact, dose_response, paraphrase, typo). v1 composite preserved at 87.0; v2 drops to 60.9 because new categories expose real production gaps -- this is the desired headroom for future work.
eval-v2-idfix 2026-04-27 68.3% fixed eval ID-matching: curated `Interaction: X x Y` rows have empty pmid; metrics now match by source_id. drug_interact 22.6 -> 54.6 (+32 pts), overall 60.9 -> 68.3 (+7.4 pts). Retrieval was already finding these records at rank 1; we just weren't crediting it.
reranker-router-prod 2026-04-27 70.0% RERANKER_ROUTER=auto + RERANKER_MIN_WORDS=10 flipped on in production. Cross-encoder (bge-reranker-v2-m3) runs only on queries >=10 words. v2 composite 68.3 -> 70.0 (+1.7), dose_response +4.6, drug_interact +1.4, long_natural +1.6, paraphrase +1.4, v1_baseline 0 (canary preserved). Prod latency: short ~0.4s, long ~0.5s (warm model).
smarter-router-signals 2026-04-27 70.9% Added typo detection (edit-distance signal) and drug-interaction pattern detection to reranker router. Typo +9.0 (74.3->83.3), drug_interact +3.1 (56.0->59.1), v1_baseline still 87.0 (canary preserved). Overall 70.0->70.9 (+0.9). No regressions in any category.

How to read these scores

  • Green (≥ 85): production-grade. Safe to surface to end users.
  • Amber (70–84): usable, but watch for regressions and prefer human review on critical paths.
  • Red (< 70): known weakness. We publish this honestly so you can decide whether to gate the feature behind an opt-in.

Methodology and source quality scoring are documented at /methodology. The autoresearch loop and prompt mutation framework live in jobs/autoresearch/.