NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity to small prompt flips
(e.g., all ↔ any, at least ↔ at most, insert/remove not, and ↔ or). It generates micro-contexts,
poses minimal pairs of questions, queries an LLM, and grades for pairwise consistency.
- Targets a common failure mode: models read the words but miss the operator change.
- Minimal setup: small synthetic contexts; exact-match grading; transparent artifacts.
- Reproducible: seedable generation, strict prompts, and self-contained evaluation.
The leaderboard below updates automatically when you generate reports. It sorts by pair joint accuracy (both items in a pair must be correct), then item accuracy.
| timestamp | client | model | pairs | seed | item_accuracy | pair_joint_accuracy | report | run_dir |
|---|---|---|---|---|---|---|---|---|
| 20250910_091655 | openrouter | google/gemini-2.5-pro | 100 | 42 | 0.990 | 0.980 | md · html · chart · dir | results/openrouter-google-gemini-2.5-pro-pairs100-20250910_085034 |
| 20250910_084619 | openrouter | google/gemini-2.5-flash | 100 | 42 | 0.870 | 0.770 | md · html · chart · dir | results/openrouter-google-gemini-2.5-flash-pairs100-20250910_084409 |
| 20250910_083514 | openrouter | openai/gpt-4o-mini | 100 | 42 | 0.775 | 0.640 | md · html · chart · dir | results/openrouter-openai-gpt-4o-mini-pairs100-20250910_083236 |
| 20250910_084052 | openrouter | google/gemini-2.5-flash-lite | 100 | 42 | 0.760 | 0.620 | md · html · chart · dir | results/openrouter-google-gemini-2.5-flash-lite-pairs100-20250910_083901 |
| 20250910_082714 | echo | echo | 100 | 42 | 0.360 | 0.160 | md · html · chart · dir | results/echo-unknown-pairs100-20250910_082714 |
- Python 3.10+
uv (recommended):
# from repo root
$ uv venv
$ uv pip install -e .pip (alternative):
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -e .Create a .env if you plan to use OpenRouter:
$ cp .env.example .env
# Then set:
# OPENROUTER_API_KEY=...
#
# Optionally set:
# OPENROUTER_BASE_URL=https://openrouter.ai/api/v1/chat/completions
# MODEL_NAME=openai/gpt-4o-miniOffline demo (no API calls):
uv run nqmp all --pairs 100 --client echoOpenRouter run:
uv run nqmp all --pairs 100 --client openrouter --model openai/gpt-4o-miniWhen --out is omitted, runs go to:
results/{client}-{model}-pairs{N}-{YYYYMMDD_HHMMSS}/
Artifacts include:
dataset.jsonlpredictions.jsonlrun.log(JSON lines, one per LLM call)run_info.jsonmetrics_{basename}.jsonoperator_accuracy_{basename}.pngreport_{basename}.mdreport_{basename}.htmlcorrect_predictions_{basename}.jsonlincorrect_predictions_{basename}.jsonl
# Generate only
uv run nqmp generate --pairs 50 --seed 7
# Run over a dataset
uv run nqmp run --in results/<dataset-dir>/dataset.jsonl --client echo
# or
uv run nqmp run --in results/<dataset-dir>/dataset.jsonl --client openrouter --model <provider/model>
# Report or re-report
uv run nqmp report --in results/<run-dir>If a run times out or fails mid-way (e.g., transient 408/409/425/429/5xx), you can resume without losing progress. Give the run a stable --out directory so you can resume it later:
# First attempt
uv run nqmp all --pairs 100 --client openrouter --model openai/gpt-5-nano --out results/openrouter-openai-gpt-5-nano-pairs100
# If it fails/interrupted, resume safely (skips already-completed items)
uv run nqmp all --pairs 100 --client openrouter --model openai/gpt-5-nano --out results/openrouter-openai-gpt-5-nano-pairs100 --resumeThe OpenRouter client retries on transient statuses 408/409/425/429/5xx with exponential backoff (base 0.8, cap 8s, up to 4 retries). If an item still fails, the harness logs an llm_error and continues to the next item.
- Dependencies:
requests,python-dotenv,pandas,matplotlib,tabulate,ruff. - Lint:
ruff checkandruff format --check - Tests:
pytest -q
src/nqmp_bench/generator.py: dataset generators (boolean and id-list operators)src/nqmp_bench/harness.py: run loop, logging, and resumesrc/nqmp_bench/client.py: OpenRouter client + echo stubsrc/nqmp_bench/grader.py: normalization and grading logicsrc/nqmp_bench/report.py: metrics, plots, and reportssrc/nqmp_bench/cli.py: CLI entry points
MIT