Skip to content

Smoke Test Results

Internal regression test suite (amem-smoketest) — 31 QA pairs across 8 categories, evaluated with gemini-3.5-flash-low as write-side LLM.

Smoke test is a regression test, not a benchmark. The dataset is self-authored and scores are not directly comparable across implementations. The purpose is to verify retrieval quality does not degrade between versions.


Overall results (v0.3.0)

MetricValue
Average Score4.56 / 5.0
Hit@164.0%
Hit@376.0%
MRR0.693

Results by category

CategoryAvg ScoreNotes
fact5.00 / 5.0
temporal5.00 / 5.0
bfs5.00 / 5.0
multihop4.20 / 5.0
semantic3.60 / 5.0Active improvement area

BFS ablation

The 2-hop BFS graph expansion is tested in isolation using bfs + multihop categories (10 questions):

BFS OFFBFS ONDelta
Average Score3.005.00+2.00
bfs category2.005.00+3.00
multihop category4.005.00+1.00

BFS provides the largest single improvement of any feature in the retrieval pipeline.

Category descriptions

CategoryWhat it tests
factDirect factual recall (e.g. account IDs, registration numbers)
temporalTime-ordered facts where older versions should be superseded
bfsMulti-note graph traversal — answer requires following link edges
multihopTwo independent facts that must be joined to answer (e.g. company → registrar → contact email)
semanticParaphrased queries that don't share keywords with stored content

Running the smoke test

bash
cd amem-smoketest
node run_smoketest.mjs

By default the smoke test uses gemini-3.5-flash-low for write-side LLM operations and gemini-pro-agent as judge. Override with:

bash
AMEM_LLM_MODEL=claude-sonnet-4-6 node run_smoketest.mjs

Released under the MIT License.