NQMP — Negation & Quantifier Minimal Pairs Bench

NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity to small prompt flips (e.g., all ↔ any, at least ↔ at most, insert/remove not, and ↔ or). It generates micro-contexts, poses minimal pairs of questions, queries an LLM, and grades for pairwise consistency.

Why NQMP?

Targets a common failure mode: models read the words but miss the operator change.
Minimal setup: small synthetic contexts; exact-match grading; transparent artifacts.
Reproducible: seedable generation, strict prompts, and self-contained evaluation.

Leaderboard

The leaderboard below updates automatically when you generate reports. It sorts by pair joint accuracy (both items in a pair must be correct), then item accuracy.

timestamp	client	model	pairs	seed	item_accuracy	pair_joint_accuracy	report	run_dir
20250910_091655	openrouter	google/gemini-2.5-pro	100	42	0.990	0.980	md · html · chart · dir	results/openrouter-google-gemini-2.5-pro-pairs100-20250910_085034
20250910_084619	openrouter	google/gemini-2.5-flash	100	42	0.870	0.770	md · html · chart · dir	results/openrouter-google-gemini-2.5-flash-pairs100-20250910_084409
20250910_083514	openrouter	openai/gpt-4o-mini	100	42	0.775	0.640	md · html · chart · dir	results/openrouter-openai-gpt-4o-mini-pairs100-20250910_083236
20250910_084052	openrouter	google/gemini-2.5-flash-lite	100	42	0.760	0.620	md · html · chart · dir	results/openrouter-google-gemini-2.5-flash-lite-pairs100-20250910_083901
20250910_082714	echo	echo	100	42	0.360	0.160	md · html · chart · dir	results/echo-unknown-pairs100-20250910_082714

Requirements

Python 3.10+

Install

uv (recommended):

# from repo root
$ uv venv
$ uv pip install -e .

pip (alternative):

$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -e .

Configure

Create a .env if you plan to use OpenRouter:

$ cp .env.example .env
# Then set:
# OPENROUTER_API_KEY=...
#
# Optionally set:
# OPENROUTER_BASE_URL=https://openrouter.ai/api/v1/chat/completions
# MODEL_NAME=openai/gpt-4o-mini

Quickstart

Offline demo (no API calls):

uv run nqmp all --pairs 100 --client echo

OpenRouter run:

uv run nqmp all --pairs 100 --client openrouter --model openai/gpt-4o-mini

Outputs

When --out is omitted, runs go to:

results/{client}-{model}-pairs{N}-{YYYYMMDD_HHMMSS}/

Artifacts include:

dataset.jsonl
predictions.jsonl
run.log (JSON lines, one per LLM call)
run_info.json
metrics_{basename}.json
operator_accuracy_{basename}.png
report_{basename}.md
report_{basename}.html
correct_predictions_{basename}.jsonl
incorrect_predictions_{basename}.jsonl

CLI

# Generate only
uv run nqmp generate --pairs 50 --seed 7

# Run over a dataset
uv run nqmp run --in results/<dataset-dir>/dataset.jsonl --client echo
# or
uv run nqmp run --in results/<dataset-dir>/dataset.jsonl --client openrouter --model <provider/model>

# Report or re-report
uv run nqmp report --in results/<run-dir>

Resuming Runs

If a run times out or fails mid-way (e.g., transient 408/409/425/429/5xx), you can resume without losing progress. Give the run a stable --out directory so you can resume it later:

# First attempt
uv run nqmp all --pairs 100 --client openrouter --model openai/gpt-5-nano --out results/openrouter-openai-gpt-5-nano-pairs100

# If it fails/interrupted, resume safely (skips already-completed items)
uv run nqmp all --pairs 100 --client openrouter --model openai/gpt-5-nano --out results/openrouter-openai-gpt-5-nano-pairs100 --resume

Retry Behavior

The OpenRouter client retries on transient statuses 408/409/425/429/5xx with exponential backoff (base 0.8, cap 8s, up to 4 retries). If an item still fails, the harness logs an llm_error and continues to the next item.

Development

Dependencies: requests, python-dotenv, pandas, matplotlib, tabulate, ruff.
Lint: ruff check and ruff format --check
Tests: pytest -q

Project Layout

src/nqmp_bench/generator.py: dataset generators (boolean and id-list operators)
src/nqmp_bench/harness.py: run loop, logging, and resume
src/nqmp_bench/client.py: OpenRouter client + echo stub
src/nqmp_bench/grader.py: normalization and grading logic
src/nqmp_bench/report.py: metrics, plots, and reports
src/nqmp_bench/cli.py: CLI entry points

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
results		results
scripts		scripts
src/nqmp_bench		src/nqmp_bench
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
leaderboard.csv		leaderboard.csv
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NQMP — Negation & Quantifier Minimal Pairs Bench

Table of Contents

Why NQMP?

Leaderboard

Requirements

Install

Configure

Quickstart

Outputs

CLI

Resuming Runs

Retry Behavior

Development

Project Layout

License

About

Uh oh!

Releases

Packages

Languages

License

sashsinha/nqmp-bench

Folders and files

Latest commit

History

Repository files navigation

NQMP — Negation & Quantifier Minimal Pairs Bench

Table of Contents

Why NQMP?

Leaderboard

Requirements

Install

Configure

Quickstart

Outputs

CLI

Resuming Runs

Retry Behavior

Development

Project Layout

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages