llmqt/CLAUDE.md

# CLAUDE.md — llmqt

## Project overview

`llmqt` (LLM Query Tester) is a single-file Python CLI that batch-tests multiple LLM models
against a set of queries. Results are written as Markdown files with per-query stats and
optional reasoning sections.

## Structure

```
llmqt/
  llmqt.py                  # entire implementation — single module
  pyproject.toml            # build/install config; declares `llmqt` console script
  example_test.yaml         # MUST be kept up to date with every config format change
  example_system_prompt.md  # system prompt used by example_test.yaml
  README.md
  CLAUDE.md
  .gitignore
```

## Installation

```bash
pip install -e .
```

Registers the `llmqt` entry point from `pyproject.toml` so the command works from any directory.

## CLI signature

```
llmqt <system_prompt.md> <config1.yaml> [config2.yaml ...]
```

- **First argument**: path to a `.md` file containing the system prompt (resolved from CWD)
- **Remaining arguments**: one or more test config files (YAML or JSON)

## Environment variables

| Variable         | Required | Purpose                                              |
|------------------|----------|------------------------------------------------------|
| `OPENAI_API_KEY` | Yes      | API key                                              |
| `OPENAI_API_BASE`| No       | Custom base URL for OpenAI-compatible endpoints      |

`OPENAI_BASE_URL` is also accepted as an alias for `OPENAI_API_BASE`.

## Config file format (YAML or JSON)

**IMPORTANT: whenever the config format changes, update `example_test.yaml` to reflect it.**

The system prompt is **not** part of the config file — it is passed as the first CLI argument.

### YAML example

```yaml
models:
  - gpt-4o-mini
  - gpt-4o

queries:
  - "First query text"
  - "Second query text"
```

### JSON equivalent

```json
{
  "models": ["gpt-4o-mini", "gpt-4o"],
  "queries": ["First query text", "Second query text"]
}
```

### Field reference

| Field     | Type            | Description                                          |
|-----------|-----------------|------------------------------------------------------|
| `models`  | list of strings | Model names; any OpenAI-compatible identifier        |
| `queries` | list of strings | Queries sent to each model in listed order           |

## Execution logic

```
for each config file:
  for each model:
    for each query:
      POST to API (with timing), wait for response
    write <config_stem>/<model_name>.md  (in CWD)
```

Output directory is always relative to the **current working directory**, not the config file
location. This lets the user run `llmqt ~/configs/prompt.md ~/configs/test1.yaml` from any
writable directory and have outputs land there.

## Filename sanitization

Model names are sanitized for filesystem safety: characters outside `[A-Za-z0-9._- ]` are
replaced with `_`. E.g. `anthropic/claude-3` → `anthropic_claude-3.md`.

## Reasoning detection

Checked in this order:
1. `message.reasoning_content` attribute (DeepSeek API / some OpenAI-compatible endpoints)
2. `<think>...</think>` tags in the response content (DeepSeek R1, QwQ open-source models)

If reasoning is found it is stripped from the answer and rendered in a separate section.

## Output format per model file

```markdown
# <model name>

**Config:** `test1.yaml`

## Statistics

| Query | Elapsed | Prompt tok | Completion tok | Total tok | tok/s |
|-------|---------|------------|----------------|-----------|-------|
| 1     | 1.2s    | 45         | 120            | 165       | 100.0 |
| Total | 1.2s    | 45         | 120            | 165       | 100.0 |

---

## Query 1

> <query text>

*1.2s · 120 completion tokens · 100.0 tok/s*

### Reasoning        ← only present when reasoning was detected

<reasoning text>

### Response

<answer text>

---
```

## Dependencies

- `openai >= 1.0.0` — API client
- `pyyaml >= 6.0` — YAML parsing (imported lazily; JSON works without it)