Single-file Python CLI to batch-test multiple LLM models with predefined queries. Supports YAML/JSON config, reasoning detection (<think> tags and reasoning_content field), per-query token/speed stats, and graceful API error handling. Install with `pip install -e .` to get the `llmqt` command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1.9 KiB
1.9 KiB
llmqt — LLM Query Tester
Batch-test multiple LLM models against a set of queries. Results are saved as nicely formatted Markdown files — one per model — including per-query stats and a summary table.
Install
pip install -e .
This installs the llmqt command into your PATH.
Setup
Export your API credentials:
export OPENAI_API_KEY=your_key_here
export OPENAI_API_BASE=https://your-endpoint/v1 # optional, for custom/local endpoints
Usage
llmqt <system_prompt.md> <config1.yaml> [config2.yaml ...]
Examples:
llmqt prompt.md test1.yaml
llmqt prompt.md test1.yaml test2.yaml test3.json
Outputs are written to ./<config_stem>/<model_name>.md in the current working directory.
Config file format
YAML (.yaml / .yml) and JSON (.json) are both supported.
models:
- gpt-4o-mini
- gpt-4o
queries:
- "What is the capital of France?"
- "Explain TCP vs UDP."
- "Write a Python prime-checker function."
See example_test.yaml and example_system_prompt.md.
Output format
For llmqt prompt.md test1.yaml with models gpt-4o-mini and gpt-4o:
test1/
gpt-4o-mini.md
gpt-4o.md
Each file contains:
- A statistics table (elapsed time, prompt/completion tokens, tok/s per query + totals)
- For each query: the query text, per-query stats, optional Reasoning section (if the model returns chain-of-thought), and the Response
Reasoning detection
Reasoning content is extracted automatically from:
- The
reasoning_contentfield on the message (DeepSeek API style) <think>...</think>tags in the response content (DeepSeek R1 / QwQ open-source style)
Execution order
for each config file:
for each model:
for each query → POST to API, wait for response
write <config_stem>/<model>.md in CWD