Eval Reports
We publish our evaluator baselines so customers can audit the quality of the Priolix Evidence API.
Numbers below are auto-loaded from the autoresearch pipeline; re-running
python3 jobs/autoresearch/run_all_evals.py refreshes this page.
Overall Composite
Mean of non-skipped evaluator composites from the latest aggregate run. Last refreshed: .
FAQ Generation
98.8Quality of LLM-generated supplement FAQs (grounding, accuracy, safety, hallucination).
| Composite | 98.8% |
|---|---|
| Grounding | 98.8% |
| Accuracy | 100.0% |
| Safety | 96.2% |
| Hallucination-free | 100.0% |
Chat Answer Quality
91.2End-to-end chat answer quality across grounded/safety/misconception edge cases.
| Composite | 91.2% |
|---|---|
| Grounding | 85.1% |
| Accuracy | 93.9% |
| Safety | 96.3% |
| Hallucination-free | 89.5% |
Evidence Retrieval
87.0Hybrid (BM25 + dense) retrieval quality on the 148-query v2 eval set (5 production-gap categories + 24 v1 baseline).
| Composite | 87.0% |
|---|---|
| Recall@15 | 83.6% |
| Recall@50 | 100.0% |
| MRR | 94.6% |
| nDCG@15 | 86.3% |
Production: hybrid retrieval (BM25 + BGE-M3 via Reciprocal Rank Fusion) over 14,344 sources. Cross-encoder reranker (bge-reranker-v2-m3) enabled for queries >=10 words via RERANKER_ROUTER=auto. HyDE + query decomposition shipped behind feature flags, default OFF. Eval set v2 (148 queries) is now default; v1 (24 queries, 87.0 composite) preserved as a tagged sub-bucket for trend continuity.
Biomarker → Evidence Mapping
97.6Quality of biomarker → condition / supplement / lifestyle mappings used by the lab API.
| Composite | 97.6% |
|---|---|
| Condition Relevance | 93.8% |
| Supplement Evidence | 99.4% |
| Lifestyle Fit | 99.7% |
Post-Phase 4. Tier-D backfill (173 PubMed sources for 7 slugs) lifted supplement_evidence to 99.4.
Blood Test Extraction
89.2Accuracy of GPT-4o vision extraction + cross-reference against curated lab reports.
| Composite | 89.2% |
|---|---|
| Extraction | 93.7% |
| Flagging | 90.6% |
| Cross-ref | 83.3% |
Post-Phase 4. Flag accuracy held at 90.6, crossref quality improved to 83.3 (Tier-D promotions benefit supplement evidence quality).
Retrieval — under the hood
Production retrieval combines BM25 (keyword) and dense embeddings via Reciprocal Rank Fusion, over
BAAI/bge-m3
on a 14,344-source corpus.
Eval set: 24 labelled queries, capped at 15 relevant PMIDs each.
Component contribution (ablation)
What each retrieval leg buys us, holding everything else constant:
| Mode | Composite | Recall@15 | Recall@20 | Recall@50 | MRR |
|---|---|---|---|---|---|
| Hybrid (production) | 87.0% | 83.6% | 91.9% | 100.0% | 94.6% |
| Dense Only | 79.6% | 74.2% | 86.1% | 98.6% | 91.2% |
| Bm25 Only | 72.5% | 65.6% | 77.2% | 95.3% | 89.2% |
Both legs pull their weight — hybrid beats dense-only by 7.4 composite points and BM25-only by 14.5 points. BM25 catches exact technical strings (PMIDs, drug names, doses); dense catches synonyms and paraphrasing. Their failure modes are complementary.
Embedding bake-off
Holding BM25 + ranking constant, swapping the dense embedder:
| Model | Dim | Composite | Recall@15 |
|---|---|---|---|
BAAI/bge-m3 |
1024 | 87.0% | 83.6% |
BAAI/bge-large-en-v1.5 |
1024 | 81.2% | 76.1% |
NeuML/pubmedbert-base-embeddings |
768 | 80.9% | 73.3% |
BGE-M3 wins by ~6 composite points over both alternatives. Notably, the biomedical specialist (PubMedBERT) loses to two general-purpose models — modern scale (~2.3 B params) beats domain specialization (~110 M params) on this corpus.
Eval set v2 — per-category breakdown
The legacy 24-query gold set saturated at composite 87.0 (recall@50 = 100%), so we expanded it to 148 queries across five production-gap categories using local Qwen 3.6 35B for both query generation and relevance labelling. The pipeline is fully automated and reproducible — no human review.
Qwen 3.6 35B (a3b) via Ollama; 3-leg pool union (hybrid/bm25_only/dense_only top-50); 6 hard rule filters; no human review
| Category | Queries | Composite | Recall@15 | MRR |
|---|---|---|---|---|
| v1 baseline | 24 | 87.0% | 83.6% | 94.6% |
| paraphrase | 10 | 82.7% | 77.0% | 100.0% |
| typo | 6 | 74.3% | 70.0% | 86.7% |
| long natural | 60 | 66.6% | 56.5% | 91.0% |
| dose response | 20 | 61.3% | 56.0% | 81.6% |
| drug interact | 28 | 54.6% | 50.2% | 95.5% |
v2 still exposes production gaps vs the saturated v1 set, but the gaps are smaller than initially reported because of an ID-matching bug (see id_match_fix). After the fix, drug_interact is 54.6 (vs v1=87.0, ~32 pts headroom) and remains the hardest category. Long_natural and dose_response also have ~20-25 pt headroom. Set EVAL_SET_VERSION=v1 to fall back to the legacy 24-query set.
Hybrid vs BM25-only spread: 15.7 pts overall,
min 7.2 pts on typo — improvements will be measurable.
How we got here
| Version | Date | Composite | Note |
|---|---|---|---|
| pre-day1 | 2026-04-26 | 47.7% | label-bound metric (gold set had >15 PMIDs/query, capping recall@15 mathematically) |
| post-day1 | 2026-04-27 | 87.0% | metric fix + multi-query BM25 fan-out + rebalanced ranking weights |
| post-day3 | 2026-04-27 | 87.0% | HyDE + decomposition shipped behind flags, kept OFF (HyDE hurts top-15 by 3 pts on this eval) |
| post-day4 | 2026-04-27 | 87.0% | embedding bake-off confirmed BGE-M3 wins by 6 pts vs PubMedBERT and BGE-large-en |
| eval-v2 | 2026-04-27 | 60.9% | eval set expanded 24->148 queries via local Qwen 3.6 (long_natural, drug_interact, dose_response, paraphrase, typo). v1 composite preserved at 87.0; v2 drops to 60.9 because new categories expose real production gaps -- this is the desired headroom for future work. |
| eval-v2-idfix | 2026-04-27 | 68.3% | fixed eval ID-matching: curated `Interaction: X x Y` rows have empty pmid; metrics now match by source_id. drug_interact 22.6 -> 54.6 (+32 pts), overall 60.9 -> 68.3 (+7.4 pts). Retrieval was already finding these records at rank 1; we just weren't crediting it. |
| reranker-router-prod | 2026-04-27 | 70.0% | RERANKER_ROUTER=auto + RERANKER_MIN_WORDS=10 flipped on in production. Cross-encoder (bge-reranker-v2-m3) runs only on queries >=10 words. v2 composite 68.3 -> 70.0 (+1.7), dose_response +4.6, drug_interact +1.4, long_natural +1.6, paraphrase +1.4, v1_baseline 0 (canary preserved). Prod latency: short ~0.4s, long ~0.5s (warm model). |
| smarter-router-signals | 2026-04-27 | 70.9% | Added typo detection (edit-distance signal) and drug-interaction pattern detection to reranker router. Typo +9.0 (74.3->83.3), drug_interact +3.1 (56.0->59.1), v1_baseline still 87.0 (canary preserved). Overall 70.0->70.9 (+0.9). No regressions in any category. |
How to read these scores
- Green (≥ 85): production-grade. Safe to surface to end users.
- Amber (70–84): usable, but watch for regressions and prefer human review on critical paths.
- Red (< 70): known weakness. We publish this honestly so you can decide whether to gate the feature behind an opt-in.
Methodology and source quality scoring are documented at
/methodology. The autoresearch loop and prompt
mutation framework live in jobs/autoresearch/.