flowchart TD
A([document directory]) --> S[Scan & chunk all files\nSQLite state init]
S --> R1
subgraph Round 1 - Discovery
R1[Build JSONL\nlanguage + schema per doc]
R1 -->|submit batch job| API1[Batch API]
API1 -->|poll until complete| P1[Parse responses\nstore language + schema]
end
P1 --> R2
subgraph Round 2 - Processing
R2[Build JSONL\ntitle + summary + rephrase + extraction per chunk]
R2 -->|submit batch job| API2[Batch API]
API2 -->|poll until complete| P2[Parse responses\nstore per-chunk graphs]
end
P2 --> R3
subgraph Round 3 - Finalisation
R3[Build JSONL\nconsolidation summaries + community labels]
R3 -->|submit batch job| API3[Batch API]
API3 -->|poll until complete| P3[Parse responses\nbuild final graph]
end
P3 --> OUT[Write graph.json + index.html\nper document]
OUT -->|optional| R4[Round 4 — Cross-file consolidation]
Batch processing
Both OpenAI and Google Gemini provide a Batch API that processes requests asynchronously at roughly 50 % of real-time pricing. Knwler ships a dedicated batch processor for each provider. Give either one a directory of documents and it will run the complete knowledge-graph pipeline — chunking, schema discovery, extraction, consolidation, community labelling — and write per-document graph.json + index.html files when done.
| OpenAI Batch | Gemini Batch | |
|---|---|---|
| Command | batch-openai run |
batch-gemini run |
| API key env var | OPENAI_API_KEY |
GEMINI_API_KEY / GOOGLE_API_KEY |
| State database | batch.db |
batch_gemini.db |
| Extra dependency | — | pip install google-genai |
| Cost vs real-time | ~50 % | ~50 % |
| SLA | 24 h | 24 h |
How it works
Both processors share the same three-round architecture. All LLM calls within a round are batched into a single API job, which is submitted and polled until complete before the next round begins.
Each round is resumable — if the process is interrupted, re-running the same command picks up exactly where it left off using the SQLite state database.
OpenAI Batch Processing
Setup
export OPENAI_API_KEY="your-key-here"No extra dependencies are needed beyond knwler’s standard install.
Commands
Run (or resume) the full pipeline
python main.py batch-openai run \
--input ./documents \
--output ./resultsCheck status of a running or completed pipeline
python main.py batch-openai status \
--input ./documents \
--output ./resultsRun cross-file consolidation on an already-processed output directory
python main.py batch-openai consolidate \
--input ./documents \
--output ./resultsOptions
| Flag | Default | Description |
|---|---|---|
--input / -i |
— | Directory of source documents |
--output / -o |
— | Output directory |
--discovery-model |
gpt-4o-mini |
Model for schema/language discovery |
--extraction-model |
gpt-4o-mini |
Model for extraction and consolidation |
--template |
default |
HTML report template |
--consolidate |
off | Also run a 4th round to merge all document graphs |
Examples
# Use larger models for discovery and extraction
python main.py batch-openai run \
-i ./docs -o ./out \
--discovery-model gpt-4o \
--extraction-model gpt-4o-mini
# Process and consolidate all documents in one go
python main.py batch-openai run \
-i ./pdfs -o ./batching \
--consolidate
# Consolidate after the fact
python main.py batch-openai consolidate \
-i ./pdfs -o ./batchingAPI limits
| Limit | Value |
|---|---|
| Requests per batch | 50 000 |
| Batch file size | 200 MB |
| Token budget per batch | 2 000 000 (with 10 % safety buffer) |
OpenAI commits to completing batch jobs within 24 hours; in practice most finish in minutes.
Gemini Batch Processing
Setup
Install the Google AI SDK:
pip install google-genaiExport your API key:
export GEMINI_API_KEY="your-key-here"
# or
export GOOGLE_API_KEY="your-key-here"Commands
Run (or resume) the full pipeline
python main.py batch-gemini run \
--input ./documents \
--output ./resultsCheck status of a running or completed pipeline
python main.py batch-gemini status \
--input ./documents \
--output ./resultsRun cross-file consolidation on an already-processed output directory
python main.py batch-gemini consolidate \
--input ./documents \
--output ./resultsOptions
| Flag | Default | Description |
|---|---|---|
--input / -i |
— | Directory of source documents |
--output / -o |
— | Output directory |
--discovery-model |
gemini-3.1-flash-lite-preview |
Model for schema/language discovery |
--extraction-model |
gemini-3.1-flash-lite-preview |
Model for extraction and consolidation |
--template |
default |
HTML report template |
--consolidate |
off | Also run a 4th round to merge all document graphs |
Examples
# Use a larger model
python main.py batch-gemini run \
-i ./docs -o ./out \
--discovery-model gemini-3-flash-preview \
--extraction-model gemini-3-flash-preview
# Process and consolidate in one shot
python main.py batch-gemini run \
-i ./pdfs -o ./batching \
--consolidate
# Consolidate a completed run
python main.py batch-gemini consolidate \
-i ./pdfs -o ./batchingHow Gemini batches are submitted
The Gemini processor uses the google-genai SDK and the Gemini Batch API. For each round it:
- Serialises all prompts as a JSONL file (one JSON object per request).
- Uploads the file via the Gemini File API.
- Creates a batch job referencing the uploaded file.
- Polls with exponential backoff (initial 30 s, max 5 min) until the job reaches a terminal state.
- Downloads and parses the output JSONL.
Terminal states: JOB_STATE_SUCCEEDED, JOB_STATE_FAILED, JOB_STATE_CANCELLED, JOB_STATE_EXPIRED.
API limits
| Limit | Value |
|---|---|
| Batch file size | 2 GB |
Supported file types
Both processors accept: .pdf, .txt, .md, .text, .markdown
State database & resumability
Each processor writes a SQLite database to the output directory (batch.db for OpenAI, batch_gemini.db for Gemini). It tracks:
- Per-document: text content, chunks, language, schema, per-chunk extraction results, consolidated graph.
- Per-round: batch job ID / job name, status, request count, timestamps.
If the process is killed or a round fails, simply re-run the same command — it will skip completed rounds and resume from the first incomplete one.
[!IMPORTANT] The database is designed for crash recovery, not incremental updates. If you add new documents or want to reprocess from scratch, delete the output directory first.
Cross-file consolidation (Round 4)
When --consolidate is passed (or the consolidate sub-command is used), a fourth batch round merges all per-document graphs into a single consolidated_graph.json:
- All entity and relation descriptions across all documents are aggregated.
- Duplicate descriptions are summarised via a batch LLM call.
- A final merged graph with community detection is written to the output directory.
You can also consolidate after the fact using knwler’s standard consolidation command:
python main.py consolidate --dir ./results --output ./mergedChoosing between real-time and batch
| Scenario | Recommendation |
|---|---|
| Single document, interactive use | extract (real-time) |
| A few documents | extract (real-time, parallel) |
| Tens or hundreds of documents | batch-openai run or batch-gemini run |
| Cost is the primary concern | Batch (either provider) |
| Lowest latency required | Real-time |