Knwler is an enterprise document intelligence platform that extracts structured knowledge graphs from documents using Large Language Models. It identifies entities, relationships, and topics from PDFs, text files, and Markdown — producing interactive reports and exports for graph analytics platforms.

Can Knwler run fully on-premise?

Yes. Knwler supports fully air-gapped operation via Ollama with open-weight models. No data leaves your infrastructure, making it suitable for classified and regulated environments with strict data sovereignty requirements.

What languages does Knwler support?

Knwler auto-detects document language and supports English, German, French, Spanish, and Dutch out of the box. Additional languages can be added by extending a single configuration file.

What export formats are available?

Knwler exports to JSON, GML, GraphML, and interactive HTML. These can be imported directly into Neo4j, Gephi, yEd, Memgraph, SurrealDB, or used to generate vector embeddings for semantic search.

How much does it cost to process a document?

Using cloud LLMs (OpenAI GPT-4o), processing costs approximately $0.20 per 20-page document. Running on-premise with local models is completely free after initial setup. Intelligent caching means re-runs cost nothing.

Under the hood

This page walks through the full processing pipeline knwler runs when you give it a PDF or an HTML page — from raw bytes to an interactive knowledge-graph report.

Overview

flowchart TD
    A([PDF / HTML / URL]) --> B[Fetch & Cache]
    B --> C[Text Chunking]
    C --> D[Language Detection]
    C --> E[Schema Discovery]
    D & E --> F[Parallel Chunk Extraction]
    F --> G[Consolidation]
    G --> H[Enrichments\ntitle · summary · rephrase]
    G --> I[Community Detection]
    H & I --> J[graph.json]
    J --> K[HTML Report\nindex.html]
    J --> L[GML Export\ngraph.gml]

1. Entry Point

Processing is triggered from the CLI:

python main.py extract -f document.pdf --backend openai --max-tokens 400

main.py routes to knwler/cli.py, which dispatches to knwler/cli_extract.py. The programmatic API entry point is extract_file() in knwler/api.py.

2. Fetch & Collect

Module: knwler/collect/webpage.py — WebpageCollector

The source is normalised and fetched depending on its type:

Source	Method	Output
Local PDF	`fitz.open()` text extraction	Plain text
Local text / Markdown	`Path.read_text()`	Plain text
URL → PDF	HTTP GET, bytes cached	Bytes
URL → HTML page	HTTP GET → BeautifulSoup → `markdownify`	Markdown text
Wikipedia URL	`WikipediaCollector`	Markdown + metadata

Every fetch is hashed and stored under ~/.knwler/cache/ so repeated runs skip the network entirely.

flowchart LR
    subgraph Input
        A1[Local PDF]
        A2[Local MD/TXT]
        A3[Web URL]
        A4[Wikipedia URL]
    end
    subgraph WebpageCollector
        B1[fitz PDF extraction]
        B2[Path.read_text]
        B3[HTTP GET → markdownify]
        B4[WikipediaCollector]
        C[(~/.knwler/cache/)]
    end
    A1 --> B1
    A2 --> B2
    A3 --> B3
    A4 --> B4
    B1 & B2 & B3 & B4 --> C
    C --> D([raw text])

3. Text Chunking

Module: knwler/chunking.py — chunk_text()

The raw text is tokenised using tiktoken (cl100k_base encoder) and cut into overlapping windows:

Window size: config.max_tokens (default 400 tokens)
Overlap: config.overlap_tokens (default 50 tokens)
Sentence-boundary awareness: breaks at ., !, or ? near the end of each window to avoid mid-sentence splits

flowchart LR
    T([raw text]) --> TK[tiktoken tokenise]
    TK --> W["sliding window\n400 tokens / 50 overlap"]
    W --> SB[sentence boundary\nadjustment]
    SB --> CH(["[chunk_0, chunk_1, …, chunk_n]"])

4. Language Detection & Schema Discovery

Module: knwler/discovery.py

Two LLM calls run on a small sample of the text before the expensive per-chunk extraction begins.

sequenceDiagram
    participant D as discovery.py
    participant L as LLM
    D->>L: detect_language(sample)
    L-->>D: "en" (ISO 639-1)
    D->>L: discover_schema(sample from start+mid+end)
    L-->>D: Schema { entity_types, relation_types, reasoning }

Schema discovery prompt (condensed):

Analyze this text and identify the best entity types and relation types
for a knowledge graph.

TEXT SAMPLE: [~4000 chars from beginning, middle, end]

Guidelines:
- Use snake_case for type names
- Focus on types that appear multiple times

Return JSON:
{
  "entity_types": ["Person", "Organization", ...],
  "relation_types": ["works_at", "founded_by", ...],
  "reasoning": "Brief explanation"
}

The resulting Schema drives all subsequent extraction — the LLM is told exactly which types to look for.

5. Parallel Chunk Extraction

Module: knwler/extraction.py — extract_all()

For every chunk an async LLM call is made. Up to config.max_concurrent (default 8) run in parallel, controlled by a semaphore.

flowchart TD
    CH(["[chunk_0 … chunk_n]"]) --> S{semaphore\nmax 8 concurrent}
    S --> E0[extract chunk_0]
    S --> E1[extract chunk_1]
    S --> EN[extract chunk_n]
    E0 & E1 & EN --> P[(partial.json\nincremental save)]
    P --> R(["[ExtractionResult_0 … ExtractionResult_n]"])

Extraction prompt (condensed):

Extract a knowledge graph from the text below.

ENTITY TYPES: [Person, Organization, ...]
RELATION TYPES: [works_at, founded_by, ...]

RULES:
- Only extract what is explicitly stated in the text
- Each entity: name, type, 1-2 sentence description
- Each relation: source, source_type, target, target_type,
                 type, description, strength (1-10)

TEXT: "..."

Return JSON:
{
  "entities": [{"name": "...", "type": "...", "description": "..."}],
  "relations": [{"source": "...", "source_type": "...",
                 "target": "...", "target_type": "...",
                 "type": "...", "description": "...", "strength": 8}]
}

Each chunk produces an ExtractionResult with its own entity and relation lists. Partial results are saved to output.partial.json during processing so progress survives interruptions.

LLM Backends

Backend	Default model	Endpoint
Ollama	local (e.g. `qwen2.5:3b`)	`localhost:11434`
OpenAI	`gpt-4o-mini`	`api.openai.com`
Anthropic	`claude-haiku-4-5-20251001`	`api.anthropic.com`
Google Gemini	`gemini-3.1-flash-lite-preview`	`generativelanguage.googleapis.com`
GitHub Models	`claude-opus-4.5`	`models.inference.ai.azure.com`
LM Studio	local (user-loaded)	`localhost:1234`

All LLM responses are cached by a hash of (prompt, model, temperature) — repeated runs with the same content are free.

6. Consolidation

Module: knwler/consolidation.py — consolidate_extracted_graphs()

The many per-chunk graphs are merged in four phases:

flowchart TD
    R(["[ExtractionResult_0 … n]"]) --> P1

    subgraph Consolidation
        P1["Phase 1 — Aggregate\ndeduplicate by (name, type)"]
        P2["Phase 2 — Summarise descriptions\nLLM merge, batch 20 items/call"]
        P3["Phase 3 — Build final graph\naverage strength values"]
        P4["Phase 4 — Filter noise\nremove zero-degree / trivial nodes"]
        P1 --> P2 --> P3 --> P4
    end

    P4 --> CG([Consolidated Graph])

Phase	What happens
1 — Aggregate	Entities deduplicated by `(name.lower(), type.lower())`; all descriptions and chunk IDs collected
2 — Summarise	Entities/relations that appeared in multiple chunks get one merged description via LLM (batched 20/call)
3 — Build	Final entity and relation lists assembled; relation `strength` averaged across chunks
4 — Filter	Removes stopwords, pure numbers, and entities with no graph connections

7. Enrichments

Module: knwler/extras.py

Three LLM tasks run concurrently after consolidation:

flowchart LR
    CG([Consolidated Graph]) --> T[Title extraction\nfirst 3 chunks]
    CG --> S[Summary extraction\n3–5 sentences]
    CG --> RP[Chunk rephrasing\nplainer language per chunk]
    T & S & RP --> KG([KnowledgeGraph])

Title — short document title inferred from the opening chunks.
Summary — 3–5 sentence overview of the document.
Rephrase — each chunk is rewritten in plainer language; used in the HTML report’s toggle (original ↔︎ simplified).

8. Community Detection & Labeling

Module: knwler/clustering.py — cluster_graph()

flowchart TD
    KG([KnowledgeGraph\nentities + relations]) --> NX[Build NetworkX graph\nnodes=entities · edges=relations\nweight=strength]
    NX --> LV[Louvain algorithm\ncommunity partitioning]
    LV --> CL["Communities\n[set_0, set_1, ..., set_k]"]
    CL --> LL[LLM label each community\n1–3 topics + description]
    LL --> OUT([entities tagged\nwith community_id])

The consolidated graph is loaded into NetworkX (entities = nodes, relations = weighted edges).
The Louvain algorithm (louvain_communities(), seed=0 for reproducibility) partitions nodes into disjoint communities.
For each community, an LLM call returns 1–3 short topic labels and a sentence description.
Each entity gains a community_id attribute.

9. Output

graph.json schema

{
  "id":       UUID,
  "title":    "...",
  "summary":  "...",
  "url":      "source URL (if fetched)",
  "language": "en",
  "schema": {
    "entity_types":   [...],
    "relation_types": [...],
    "reasoning":      "..."
  },
  "graph": {
    "entities": [
      { "name": "Alice Smith", "type": "Person",
        "description": "...", "chunk_ids": [...], "community_id": 0 }
    ],
    "relations": [
      { "source": "Alice Smith", "source_type": "Person",
        "target": "Tech Corp", "target_type": "Organization",
        "type": "works_at", "description": "...",
        "strength": 8.5, "chunk_ids": [...] }
    ],
    "communities": [
      { "id": 0, "topics": ["Leadership", "Management"],
        "description": "...", "members": ["Alice Smith::Person", ...] }
    ]
  },
  "chunks": [
    { "id": UUID, "chunk_idx": 0,
      "text": "original text",
      "rephrase": "simplified version",
      "entities": [...], "relations": [...] }
  ]
}

Output files

File	Contents
`graph.json`	Canonical structured knowledge graph
`index.html`	Self-contained interactive report
`graph.gml`	NetworkX export for Gephi / yEd
`log.html`	Rich formatted processing log
`log.txt`	Plain-text processing log

HTML Report

The HTML report (knwler/export.py + Jinja2 templates in knwler/templates/) is fully self-contained — no external dependencies, all CSS and JavaScript are inlined.

Features: - Interactive network visualisation with a degree-threshold slider - Entity index with type badges and links to source chunks - Community / topic overview with member links - Chunk viewer with original ↔︎ simplified toggle - Dark / light theme switch - Entity names in chunk text auto-link to the entity detail panel

Complete Data Flow

flowchart TD
    A([PDF / HTML / URL]) -->|fetch & cache| B([raw text])
    B -->|tiktoken chunking\n400 tok / 50 overlap| C(["[chunk_0 … chunk_n]"])
    C -->|LLM sample| D([Schema\nentity & relation types])
    C -->|LLM sample| E([Language code])
    D & E & C -->|8 concurrent LLM calls per chunk| F(["[ExtractionResult_0 … n]"])
    F -->|aggregate · LLM summarise · filter| G([Consolidated Graph])
    G -->|LLM: title + summary + rephrase| H([Enriched Graph])
    H -->|NetworkX Louvain + LLM labels| I([Communities])
    H & I -->|assemble| J([graph.json])
    J -->|Jinja2 render| K([index.html])
    J -->|NetworkX write| L([graph.gml])