Single-file Python CLI to batch-test multiple LLM models with predefined queries. Supports YAML/JSON config, reasoning detection (<think> tags and reasoning_content field), per-query token/speed stats, and graceful API error handling. Install with `pip install -e .` to get the `llmqt` command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3.9 KiB
CLAUDE.md — llmqt
Project overview
llmqt (LLM Query Tester) is a single-file Python CLI that batch-tests multiple LLM models
against a set of queries. Results are written as Markdown files with per-query stats and
optional reasoning sections.
Structure
llmqt/
llmqt.py # entire implementation — single module
pyproject.toml # build/install config; declares `llmqt` console script
example_test.yaml # MUST be kept up to date with every config format change
example_system_prompt.md # system prompt used by example_test.yaml
README.md
CLAUDE.md
.gitignore
Installation
pip install -e .
Registers the llmqt entry point from pyproject.toml so the command works from any directory.
CLI signature
llmqt <system_prompt.md> <config1.yaml> [config2.yaml ...]
- First argument: path to a
.mdfile containing the system prompt (resolved from CWD) - Remaining arguments: one or more test config files (YAML or JSON)
Environment variables
| Variable | Required | Purpose |
|---|---|---|
OPENAI_API_KEY |
Yes | API key |
OPENAI_API_BASE |
No | Custom base URL for OpenAI-compatible endpoints |
OPENAI_BASE_URL is also accepted as an alias for OPENAI_API_BASE.
Config file format (YAML or JSON)
IMPORTANT: whenever the config format changes, update example_test.yaml to reflect it.
The system prompt is not part of the config file — it is passed as the first CLI argument.
YAML example
models:
- gpt-4o-mini
- gpt-4o
queries:
- "First query text"
- "Second query text"
JSON equivalent
{
"models": ["gpt-4o-mini", "gpt-4o"],
"queries": ["First query text", "Second query text"]
}
Field reference
| Field | Type | Description |
|---|---|---|
models |
list of strings | Model names; any OpenAI-compatible identifier |
queries |
list of strings | Queries sent to each model in listed order |
Execution logic
for each config file:
for each model:
for each query:
POST to API (with timing), wait for response
write <config_stem>/<model_name>.md (in CWD)
Output directory is always relative to the current working directory, not the config file
location. This lets the user run llmqt ~/configs/prompt.md ~/configs/test1.yaml from any
writable directory and have outputs land there.
Filename sanitization
Model names are sanitized for filesystem safety: characters outside [A-Za-z0-9._- ] are
replaced with _. E.g. anthropic/claude-3 → anthropic_claude-3.md.
Reasoning detection
Checked in this order:
message.reasoning_contentattribute (DeepSeek API / some OpenAI-compatible endpoints)<think>...</think>tags in the response content (DeepSeek R1, QwQ open-source models)
If reasoning is found it is stripped from the answer and rendered in a separate section.
Output format per model file
# <model name>
**Config:** `test1.yaml`
## Statistics
| Query | Elapsed | Prompt tok | Completion tok | Total tok | tok/s |
|-------|---------|------------|----------------|-----------|-------|
| 1 | 1.2s | 45 | 120 | 165 | 100.0 |
| Total | 1.2s | 45 | 120 | 165 | 100.0 |
---
## Query 1
> <query text>
*1.2s · 120 completion tokens · 100.0 tok/s*
### Reasoning ← only present when reasoning was detected
<reasoning text>
### Response
<answer text>
---
Dependencies
openai >= 1.0.0— API clientpyyaml >= 6.0— YAML parsing (imported lazily; JSON works without it)