Batch processing

Both OpenAI and Google Gemini provide a Batch API that processes requests asynchronously at roughly 50 % of real-time pricing. Knwler ships a dedicated batch processor for each provider. Give either one a directory of documents and it will run the complete knowledge-graph pipeline — chunking, schema discovery, extraction, consolidation, community labelling — and write per-document graph.json + index.html files when done.

OpenAI Batch Gemini Batch
Command batch-openai run batch-gemini run
API key env var OPENAI_API_KEY GEMINI_API_KEY / GOOGLE_API_KEY
State database batch.db batch_gemini.db
Extra dependency pip install google-genai
Cost vs real-time ~50 % ~50 %
SLA 24 h 24 h

How it works

Both processors share the same three-round architecture. All LLM calls within a round are batched into a single API job, which is submitted and polled until complete before the next round begins.

flowchart TD
    A([document directory]) --> S[Scan & chunk all files\nSQLite state init]

    S --> R1

    subgraph Round 1 - Discovery
        R1[Build JSONL\nlanguage + schema per doc]
        R1 -->|submit batch job| API1[Batch API]
        API1 -->|poll until complete| P1[Parse responses\nstore language + schema]
    end

    P1 --> R2

    subgraph Round 2 - Processing
        R2[Build JSONL\ntitle + summary + rephrase + extraction per chunk]
        R2 -->|submit batch job| API2[Batch API]
        API2 -->|poll until complete| P2[Parse responses\nstore per-chunk graphs]
    end

    P2 --> R3

    subgraph Round 3 - Finalisation
        R3[Build JSONL\nconsolidation summaries + community labels]
        R3 -->|submit batch job| API3[Batch API]
        API3 -->|poll until complete| P3[Parse responses\nbuild final graph]
    end

    P3 --> OUT[Write graph.json + index.html\nper document]
    OUT -->|optional| R4[Round 4 — Cross-file consolidation]

Each round is resumable — if the process is interrupted, re-running the same command picks up exactly where it left off using the SQLite state database.


OpenAI Batch Processing

Setup

export OPENAI_API_KEY="your-key-here"

No extra dependencies are needed beyond knwler’s standard install.

Commands

Run (or resume) the full pipeline

python main.py batch-openai run \
    --input ./documents \
    --output ./results

Check status of a running or completed pipeline

python main.py batch-openai status \
    --input ./documents \
    --output ./results

Run cross-file consolidation on an already-processed output directory

python main.py batch-openai consolidate \
    --input ./documents \
    --output ./results

Options

Flag Default Description
--input / -i Directory of source documents
--output / -o Output directory
--discovery-model gpt-4o-mini Model for schema/language discovery
--extraction-model gpt-4o-mini Model for extraction and consolidation
--template default HTML report template
--consolidate off Also run a 4th round to merge all document graphs

Examples

# Use larger models for discovery and extraction
python main.py batch-openai run \
    -i ./docs -o ./out \
    --discovery-model gpt-4o \
    --extraction-model gpt-4o-mini

# Process and consolidate all documents in one go
python main.py batch-openai run \
    -i ./pdfs -o ./batching \
    --consolidate

# Consolidate after the fact
python main.py batch-openai consolidate \
    -i ./pdfs -o ./batching

API limits

Limit Value
Requests per batch 50 000
Batch file size 200 MB
Token budget per batch 2 000 000 (with 10 % safety buffer)

OpenAI commits to completing batch jobs within 24 hours; in practice most finish in minutes.


Gemini Batch Processing

Setup

Install the Google AI SDK:

pip install google-genai

Export your API key:

export GEMINI_API_KEY="your-key-here"
# or
export GOOGLE_API_KEY="your-key-here"

Commands

Run (or resume) the full pipeline

python main.py batch-gemini run \
    --input ./documents \
    --output ./results

Check status of a running or completed pipeline

python main.py batch-gemini status \
    --input ./documents \
    --output ./results

Run cross-file consolidation on an already-processed output directory

python main.py batch-gemini consolidate \
    --input ./documents \
    --output ./results

Options

Flag Default Description
--input / -i Directory of source documents
--output / -o Output directory
--discovery-model gemini-3.1-flash-lite-preview Model for schema/language discovery
--extraction-model gemini-3.1-flash-lite-preview Model for extraction and consolidation
--template default HTML report template
--consolidate off Also run a 4th round to merge all document graphs

Examples

# Use a larger model
python main.py batch-gemini run \
    -i ./docs -o ./out \
    --discovery-model gemini-3-flash-preview \
    --extraction-model gemini-3-flash-preview

# Process and consolidate in one shot
python main.py batch-gemini run \
    -i ./pdfs -o ./batching \
    --consolidate

# Consolidate a completed run
python main.py batch-gemini consolidate \
    -i ./pdfs -o ./batching

How Gemini batches are submitted

The Gemini processor uses the google-genai SDK and the Gemini Batch API. For each round it:

  1. Serialises all prompts as a JSONL file (one JSON object per request).
  2. Uploads the file via the Gemini File API.
  3. Creates a batch job referencing the uploaded file.
  4. Polls with exponential backoff (initial 30 s, max 5 min) until the job reaches a terminal state.
  5. Downloads and parses the output JSONL.

Terminal states: JOB_STATE_SUCCEEDED, JOB_STATE_FAILED, JOB_STATE_CANCELLED, JOB_STATE_EXPIRED.

API limits

Limit Value
Batch file size 2 GB

Supported file types

Both processors accept: .pdf, .txt, .md, .text, .markdown


State database & resumability

Each processor writes a SQLite database to the output directory (batch.db for OpenAI, batch_gemini.db for Gemini). It tracks:

  • Per-document: text content, chunks, language, schema, per-chunk extraction results, consolidated graph.
  • Per-round: batch job ID / job name, status, request count, timestamps.

If the process is killed or a round fails, simply re-run the same command — it will skip completed rounds and resume from the first incomplete one.

[!IMPORTANT] The database is designed for crash recovery, not incremental updates. If you add new documents or want to reprocess from scratch, delete the output directory first.


Cross-file consolidation (Round 4)

When --consolidate is passed (or the consolidate sub-command is used), a fourth batch round merges all per-document graphs into a single consolidated_graph.json:

  1. All entity and relation descriptions across all documents are aggregated.
  2. Duplicate descriptions are summarised via a batch LLM call.
  3. A final merged graph with community detection is written to the output directory.

You can also consolidate after the fact using knwler’s standard consolidation command:

python main.py consolidate --dir ./results --output ./merged

Choosing between real-time and batch

Scenario Recommendation
Single document, interactive use extract (real-time)
A few documents extract (real-time, parallel)
Tens or hundreds of documents batch-openai run or batch-gemini run
Cost is the primary concern Batch (either provider)
Lowest latency required Real-time