Skip to content

NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity to small prompt flips.

License

Notifications You must be signed in to change notification settings

sashsinha/nqmp-bench

Repository files navigation

NQMP — Negation & Quantifier Minimal Pairs Bench

NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity to small prompt flips (e.g., all ↔ any, at least ↔ at most, insert/remove not, and ↔ or). It generates micro-contexts, poses minimal pairs of questions, queries an LLM, and grades for pairwise consistency.

Table of Contents

Why NQMP?

  • Targets a common failure mode: models read the words but miss the operator change.
  • Minimal setup: small synthetic contexts; exact-match grading; transparent artifacts.
  • Reproducible: seedable generation, strict prompts, and self-contained evaluation.

Leaderboard

The leaderboard below updates automatically when you generate reports. It sorts by pair joint accuracy (both items in a pair must be correct), then item accuracy.

timestamp client model pairs seed item_accuracy pair_joint_accuracy report run_dir
20250910_091655 openrouter google/gemini-2.5-pro 100 42 0.990 0.980 md · html · chart · dir results/openrouter-google-gemini-2.5-pro-pairs100-20250910_085034
20250910_084619 openrouter google/gemini-2.5-flash 100 42 0.870 0.770 md · html · chart · dir results/openrouter-google-gemini-2.5-flash-pairs100-20250910_084409
20250910_083514 openrouter openai/gpt-4o-mini 100 42 0.775 0.640 md · html · chart · dir results/openrouter-openai-gpt-4o-mini-pairs100-20250910_083236
20250910_084052 openrouter google/gemini-2.5-flash-lite 100 42 0.760 0.620 md · html · chart · dir results/openrouter-google-gemini-2.5-flash-lite-pairs100-20250910_083901
20250910_082714 echo echo 100 42 0.360 0.160 md · html · chart · dir results/echo-unknown-pairs100-20250910_082714

Requirements

  • Python 3.10+

Install

uv (recommended):

# from repo root
$ uv venv
$ uv pip install -e .

pip (alternative):

$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -e .

Configure

Create a .env if you plan to use OpenRouter:

$ cp .env.example .env
# Then set:
# OPENROUTER_API_KEY=...
#
# Optionally set:
# OPENROUTER_BASE_URL=https://openrouter.ai/api/v1/chat/completions
# MODEL_NAME=openai/gpt-4o-mini

Quickstart

Offline demo (no API calls):

uv run nqmp all --pairs 100 --client echo

OpenRouter run:

uv run nqmp all --pairs 100 --client openrouter --model openai/gpt-4o-mini

Outputs

When --out is omitted, runs go to:

results/{client}-{model}-pairs{N}-{YYYYMMDD_HHMMSS}/

Artifacts include:

  • dataset.jsonl
  • predictions.jsonl
  • run.log (JSON lines, one per LLM call)
  • run_info.json
  • metrics_{basename}.json
  • operator_accuracy_{basename}.png
  • report_{basename}.md
  • report_{basename}.html
  • correct_predictions_{basename}.jsonl
  • incorrect_predictions_{basename}.jsonl

CLI

# Generate only
uv run nqmp generate --pairs 50 --seed 7

# Run over a dataset
uv run nqmp run --in results/<dataset-dir>/dataset.jsonl --client echo
# or
uv run nqmp run --in results/<dataset-dir>/dataset.jsonl --client openrouter --model <provider/model>

# Report or re-report
uv run nqmp report --in results/<run-dir>

Resuming Runs

If a run times out or fails mid-way (e.g., transient 408/409/425/429/5xx), you can resume without losing progress. Give the run a stable --out directory so you can resume it later:

# First attempt
uv run nqmp all --pairs 100 --client openrouter --model openai/gpt-5-nano --out results/openrouter-openai-gpt-5-nano-pairs100

# If it fails/interrupted, resume safely (skips already-completed items)
uv run nqmp all --pairs 100 --client openrouter --model openai/gpt-5-nano --out results/openrouter-openai-gpt-5-nano-pairs100 --resume

Retry Behavior

The OpenRouter client retries on transient statuses 408/409/425/429/5xx with exponential backoff (base 0.8, cap 8s, up to 4 retries). If an item still fails, the harness logs an llm_error and continues to the next item.

Development

  • Dependencies: requests, python-dotenv, pandas, matplotlib, tabulate, ruff.
  • Lint: ruff check and ruff format --check
  • Tests: pytest -q

Project Layout

  • src/nqmp_bench/generator.py: dataset generators (boolean and id-list operators)
  • src/nqmp_bench/harness.py: run loop, logging, and resume
  • src/nqmp_bench/client.py: OpenRouter client + echo stub
  • src/nqmp_bench/grader.py: normalization and grading logic
  • src/nqmp_bench/report.py: metrics, plots, and reports
  • src/nqmp_bench/cli.py: CLI entry points

License

MIT

About

NQMP is a tiny, deterministic llm benchmark focused on logical sensitivity to small prompt flips.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published