Single-file Python CLI to batch-test multiple LLM models with predefined queries. Supports YAML/JSON config, reasoning detection (<think> tags and reasoning_content field), per-query token/speed stats, and graceful API error handling. Install with `pip install -e .` to get the `llmqt` command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
83 lines
1.9 KiB
Markdown
83 lines
1.9 KiB
Markdown
# llmqt — LLM Query Tester
|
|
|
|
Batch-test multiple LLM models against a set of queries. Results are saved as nicely formatted Markdown files — one per model — including per-query stats and a summary table.
|
|
|
|
## Install
|
|
|
|
```bash
|
|
pip install -e .
|
|
```
|
|
|
|
This installs the `llmqt` command into your PATH.
|
|
|
|
## Setup
|
|
|
|
Export your API credentials:
|
|
|
|
```bash
|
|
export OPENAI_API_KEY=your_key_here
|
|
export OPENAI_API_BASE=https://your-endpoint/v1 # optional, for custom/local endpoints
|
|
```
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
llmqt <system_prompt.md> <config1.yaml> [config2.yaml ...]
|
|
```
|
|
|
|
Examples:
|
|
|
|
```bash
|
|
llmqt prompt.md test1.yaml
|
|
llmqt prompt.md test1.yaml test2.yaml test3.json
|
|
```
|
|
|
|
Outputs are written to `./<config_stem>/<model_name>.md` in the current working directory.
|
|
|
|
## Config file format
|
|
|
|
YAML (`.yaml` / `.yml`) and JSON (`.json`) are both supported.
|
|
|
|
```yaml
|
|
models:
|
|
- gpt-4o-mini
|
|
- gpt-4o
|
|
|
|
queries:
|
|
- "What is the capital of France?"
|
|
- "Explain TCP vs UDP."
|
|
- "Write a Python prime-checker function."
|
|
```
|
|
|
|
See [example_test.yaml](example_test.yaml) and [example_system_prompt.md](example_system_prompt.md).
|
|
|
|
## Output format
|
|
|
|
For `llmqt prompt.md test1.yaml` with models `gpt-4o-mini` and `gpt-4o`:
|
|
|
|
```
|
|
test1/
|
|
gpt-4o-mini.md
|
|
gpt-4o.md
|
|
```
|
|
|
|
Each file contains:
|
|
|
|
- A **statistics table** (elapsed time, prompt/completion tokens, tok/s per query + totals)
|
|
- For each query: the query text, per-query stats, optional **Reasoning** section (if the model returns chain-of-thought), and the **Response**
|
|
|
|
### Reasoning detection
|
|
|
|
Reasoning content is extracted automatically from:
|
|
- The `reasoning_content` field on the message (DeepSeek API style)
|
|
- `<think>...</think>` tags in the response content (DeepSeek R1 / QwQ open-source style)
|
|
|
|
## Execution order
|
|
|
|
```
|
|
for each config file:
|
|
for each model:
|
|
for each query → POST to API, wait for response
|
|
write <config_stem>/<model>.md in CWD
|
|
```
|