Eval Reports

We publish our evaluator baselines so customers can audit the quality of the Priolix Evidence API. Numbers below are auto-loaded from the autoresearch pipeline; re-running python3 jobs/autoresearch/run_all_evals.py refreshes this page.

88.8

Overall Composite

Mean of non-skipped evaluator composites from the latest aggregate run. Last refreshed: 2026-04-20T16:31:47+00:00.

FAQ Generation

98.8

Quality of LLM-generated supplement FAQs (grounding, accuracy, safety, hallucination).

Composite	98.8%
Grounding	98.8%
Accuracy	100.0%
Safety	96.2%
Hallucination-free	100.0%

Last updated: 2026-04-18T21:37:58+00:00

Chat Answer Quality

91.2

End-to-end chat answer quality across grounded/safety/misconception edge cases.

Composite	91.2%
Grounding	85.1%
Accuracy	93.9%
Safety	96.3%
Hallucination-free	89.5%

Last updated: 2026-04-20T12:29:26+00:00

Evidence Retrieval

87.0

Hybrid (BM25 + dense) retrieval quality on the 148-query v2 eval set (5 production-gap categories + 24 v1 baseline).

Composite	87.0%
Recall@15	83.6%
Recall@50	100.0%
MRR	94.6%
nDCG@15	86.3%

Production: hybrid retrieval (BM25 + BGE-M3 via Reciprocal Rank Fusion) over 14,344 sources. Cross-encoder reranker (bge-reranker-v2-m3) enabled for queries >=10 words via RERANKER_ROUTER=auto. HyDE + query decomposition shipped behind feature flags, default OFF. Eval set v2 (148 queries) is now default; v1 (24 queries, 87.0 composite) preserved as a tagged sub-bucket for trend continuity.

Last updated: 2026-04-27T21:19:44+00:00

Biomarker → Evidence Mapping

97.6

Quality of biomarker → condition / supplement / lifestyle mappings used by the lab API.

Composite	97.6%
Condition Relevance	93.8%
Supplement Evidence	99.4%
Lifestyle Fit	99.7%

Post-Phase 4. Tier-D backfill (173 PubMed sources for 7 slugs) lifted supplement_evidence to 99.4.

Last updated: 2026-04-19T23:33:01+00:00

Blood Test Extraction

89.2

Accuracy of GPT-4o vision extraction + cross-reference against curated lab reports.

Composite	89.2%
Extraction	93.7%
Flagging	90.6%
Cross-ref	83.3%

Post-Phase 4. Flag accuracy held at 90.6, crossref quality improved to 83.3 (Tier-D promotions benefit supplement evidence quality).

Last updated: 2026-04-19T23:33:01+00:00

Retrieval — under the hood

Production retrieval combines BM25 (keyword) and dense embeddings via Reciprocal Rank Fusion, over BAAI/bge-m3 on a 14,344-source corpus. Eval set: 24 labelled queries, capped at 15 relevant PMIDs each.

Component contribution (ablation)

What each retrieval leg buys us, holding everything else constant:

Mode	Composite	Recall@15	Recall@20	Recall@50	MRR
Hybrid (production)	87.0%	83.6%	91.9%	100.0%	94.6%
Dense Only	79.6%	74.2%	86.1%	98.6%	91.2%
Bm25 Only	72.5%	65.6%	77.2%	95.3%	89.2%

Both legs pull their weight — hybrid beats dense-only by 7.4 composite points and BM25-only by 14.5 points. BM25 catches exact technical strings (PMIDs, drug names, doses); dense catches synonyms and paraphrasing. Their failure modes are complementary.

Embedding bake-off

Holding BM25 + ranking constant, swapping the dense embedder:

Model	Dim	Composite	Recall@15
`BAAI/bge-m3`	1024	87.0%	83.6%
`BAAI/bge-large-en-v1.5`	1024	81.2%	76.1%
`NeuML/pubmedbert-base-embeddings`	768	80.9%	73.3%

BGE-M3 wins by ~6 composite points over both alternatives. Notably, the biomedical specialist (PubMedBERT) loses to two general-purpose models — modern scale (~2.3 B params) beats domain specialization (~110 M params) on this corpus.

Eval set v2 — per-category breakdown

The legacy 24-query gold set saturated at composite 87.0 (recall@50 = 100%), so we expanded it to 148 queries across five production-gap categories using local Qwen 3.6 35B for both query generation and relevance labelling. The pipeline is fully automated and reproducible — no human review.

Qwen 3.6 35B (a3b) via Ollama; 3-leg pool union (hybrid/bm25_only/dense_only top-50); 6 hard rule filters; no human review

Category	Queries	Composite	Recall@15	MRR
v1 baseline	24	87.0%	83.6%	94.6%
paraphrase	10	82.7%	77.0%	100.0%
typo	6	74.3%	70.0%	86.7%
long natural	60	66.6%	56.5%	91.0%
dose response	20	61.3%	56.0%	81.6%
drug interact	28	54.6%	50.2%	95.5%

v2 still exposes production gaps vs the saturated v1 set, but the gaps are smaller than initially reported because of an ID-matching bug (see id_match_fix). After the fix, drug_interact is 54.6 (vs v1=87.0, ~32 pts headroom) and remains the hardest category. Long_natural and dose_response also have ~20-25 pt headroom. Set EVAL_SET_VERSION=v1 to fall back to the legacy 24-query set. Hybrid vs BM25-only spread: 15.7 pts overall, min 7.2 pts on typo — improvements will be measurable.

How we got here

Version	Date	Composite	Note
pre-day1	2026-04-26	47.7%	label-bound metric (gold set had >15 PMIDs/query, capping recall@15 mathematically)
post-day1	2026-04-27	87.0%	metric fix + multi-query BM25 fan-out + rebalanced ranking weights
post-day3	2026-04-27	87.0%	HyDE + decomposition shipped behind flags, kept OFF (HyDE hurts top-15 by 3 pts on this eval)
post-day4	2026-04-27	87.0%	embedding bake-off confirmed BGE-M3 wins by 6 pts vs PubMedBERT and BGE-large-en
eval-v2	2026-04-27	60.9%	eval set expanded 24->148 queries via local Qwen 3.6 (long_natural, drug_interact, dose_response, paraphrase, typo). v1 composite preserved at 87.0; v2 drops to 60.9 because new categories expose real production gaps -- this is the desired headroom for future work.
eval-v2-idfix	2026-04-27	68.3%	fixed eval ID-matching: curated `Interaction: X x Y` rows have empty pmid; metrics now match by source_id. drug_interact 22.6 -> 54.6 (+32 pts), overall 60.9 -> 68.3 (+7.4 pts). Retrieval was already finding these records at rank 1; we just weren't crediting it.
reranker-router-prod	2026-04-27	70.0%	RERANKER_ROUTER=auto + RERANKER_MIN_WORDS=10 flipped on in production. Cross-encoder (bge-reranker-v2-m3) runs only on queries >=10 words. v2 composite 68.3 -> 70.0 (+1.7), dose_response +4.6, drug_interact +1.4, long_natural +1.6, paraphrase +1.4, v1_baseline 0 (canary preserved). Prod latency: short ~0.4s, long ~0.5s (warm model).
smarter-router-signals	2026-04-27	70.9%	Added typo detection (edit-distance signal) and drug-interaction pattern detection to reranker router. Typo +9.0 (74.3->83.3), drug_interact +3.1 (56.0->59.1), v1_baseline still 87.0 (canary preserved). Overall 70.0->70.9 (+0.9). No regressions in any category.

How to read these scores

Green (≥ 85): production-grade. Safe to surface to end users.
Amber (70–84): usable, but watch for regressions and prefer human review on critical paths.
Red (< 70): known weakness. We publish this honestly so you can decide whether to gate the feature behind an opt-in.

Methodology and source quality scoring are documented at /methodology. The autoresearch loop and prompt mutation framework live in jobs/autoresearch/.