Under the hood

This page walks through the full processing pipeline knwler runs when you give it a PDF or an HTML page — from raw bytes to an interactive knowledge-graph report.


Overview

flowchart TD
    A([PDF / HTML / URL]) --> B[Fetch & Cache]
    B --> C[Text Chunking]
    C --> D[Language Detection]
    C --> E[Schema Discovery]
    D & E --> F[Parallel Chunk Extraction]
    F --> G[Consolidation]
    G --> H[Enrichments\ntitle · summary · rephrase]
    G --> I[Community Detection]
    H & I --> J[graph.json]
    J --> K[HTML Report\nindex.html]
    J --> L[GML Export\ngraph.gml]


1. Entry Point

Processing is triggered from the CLI:

python main.py extract -f document.pdf --backend openai --max-tokens 400

main.py routes to knwler/cli.py, which dispatches to knwler/cli_extract.py. The programmatic API entry point is extract_file() in knwler/api.py.


2. Fetch & Collect

Module: knwler/collect/webpage.pyWebpageCollector

The source is normalised and fetched depending on its type:

Source Method Output
Local PDF fitz.open() text extraction Plain text
Local text / Markdown Path.read_text() Plain text
URL → PDF HTTP GET, bytes cached Bytes
URL → HTML page HTTP GET → BeautifulSoup → markdownify Markdown text
Wikipedia URL WikipediaCollector Markdown + metadata

Every fetch is hashed and stored under ~/.knwler/cache/ so repeated runs skip the network entirely.

flowchart LR
    subgraph Input
        A1[Local PDF]
        A2[Local MD/TXT]
        A3[Web URL]
        A4[Wikipedia URL]
    end
    subgraph WebpageCollector
        B1[fitz PDF extraction]
        B2[Path.read_text]
        B3[HTTP GET → markdownify]
        B4[WikipediaCollector]
        C[(~/.knwler/cache/)]
    end
    A1 --> B1
    A2 --> B2
    A3 --> B3
    A4 --> B4
    B1 & B2 & B3 & B4 --> C
    C --> D([raw text])


3. Text Chunking

Module: knwler/chunking.pychunk_text()

The raw text is tokenised using tiktoken (cl100k_base encoder) and cut into overlapping windows:

  • Window size: config.max_tokens (default 400 tokens)
  • Overlap: config.overlap_tokens (default 50 tokens)
  • Sentence-boundary awareness: breaks at ., !, or ? near the end of each window to avoid mid-sentence splits

flowchart LR
    T([raw text]) --> TK[tiktoken tokenise]
    TK --> W["sliding window\n400 tokens / 50 overlap"]
    W --> SB[sentence boundary\nadjustment]
    SB --> CH(["[chunk_0, chunk_1, …, chunk_n]"])


4. Language Detection & Schema Discovery

Module: knwler/discovery.py

Two LLM calls run on a small sample of the text before the expensive per-chunk extraction begins.

sequenceDiagram
    participant D as discovery.py
    participant L as LLM
    D->>L: detect_language(sample)
    L-->>D: "en" (ISO 639-1)
    D->>L: discover_schema(sample from start+mid+end)
    L-->>D: Schema { entity_types, relation_types, reasoning }

Schema discovery prompt (condensed):

Analyze this text and identify the best entity types and relation types
for a knowledge graph.

TEXT SAMPLE: [~4000 chars from beginning, middle, end]

Guidelines:
- Use snake_case for type names
- Focus on types that appear multiple times

Return JSON:
{
  "entity_types": ["Person", "Organization", ...],
  "relation_types": ["works_at", "founded_by", ...],
  "reasoning": "Brief explanation"
}

The resulting Schema drives all subsequent extraction — the LLM is told exactly which types to look for.


5. Parallel Chunk Extraction

Module: knwler/extraction.pyextract_all()

For every chunk an async LLM call is made. Up to config.max_concurrent (default 8) run in parallel, controlled by a semaphore.

flowchart TD
    CH(["[chunk_0 … chunk_n]"]) --> S{semaphore\nmax 8 concurrent}
    S --> E0[extract chunk_0]
    S --> E1[extract chunk_1]
    S --> EN[extract chunk_n]
    E0 & E1 & EN --> P[(partial.json\nincremental save)]
    P --> R(["[ExtractionResult_0 … ExtractionResult_n]"])

Extraction prompt (condensed):

Extract a knowledge graph from the text below.

ENTITY TYPES: [Person, Organization, ...]
RELATION TYPES: [works_at, founded_by, ...]

RULES:
- Only extract what is explicitly stated in the text
- Each entity: name, type, 1-2 sentence description
- Each relation: source, source_type, target, target_type,
                 type, description, strength (1-10)

TEXT: "..."

Return JSON:
{
  "entities": [{"name": "...", "type": "...", "description": "..."}],
  "relations": [{"source": "...", "source_type": "...",
                 "target": "...", "target_type": "...",
                 "type": "...", "description": "...", "strength": 8}]
}

Each chunk produces an ExtractionResult with its own entity and relation lists. Partial results are saved to output.partial.json during processing so progress survives interruptions.

LLM Backends

Backend Default model Endpoint
Ollama local (e.g. qwen2.5:3b) localhost:11434
OpenAI gpt-4o-mini api.openai.com
Anthropic claude-haiku-4-5-20251001 api.anthropic.com
Google Gemini gemini-3.1-flash-lite-preview generativelanguage.googleapis.com
GitHub Models claude-opus-4.5 models.inference.ai.azure.com
LM Studio local (user-loaded) localhost:1234

All LLM responses are cached by a hash of (prompt, model, temperature) — repeated runs with the same content are free.


6. Consolidation

Module: knwler/consolidation.pyconsolidate_extracted_graphs()

The many per-chunk graphs are merged in four phases:

flowchart TD
    R(["[ExtractionResult_0 … n]"]) --> P1

    subgraph Consolidation
        P1["Phase 1 — Aggregate\ndeduplicate by (name, type)"]
        P2["Phase 2 — Summarise descriptions\nLLM merge, batch 20 items/call"]
        P3["Phase 3 — Build final graph\naverage strength values"]
        P4["Phase 4 — Filter noise\nremove zero-degree / trivial nodes"]
        P1 --> P2 --> P3 --> P4
    end

    P4 --> CG([Consolidated Graph])

Phase What happens
1 — Aggregate Entities deduplicated by (name.lower(), type.lower()); all descriptions and chunk IDs collected
2 — Summarise Entities/relations that appeared in multiple chunks get one merged description via LLM (batched 20/call)
3 — Build Final entity and relation lists assembled; relation strength averaged across chunks
4 — Filter Removes stopwords, pure numbers, and entities with no graph connections

7. Enrichments

Module: knwler/extras.py

Three LLM tasks run concurrently after consolidation:

flowchart LR
    CG([Consolidated Graph]) --> T[Title extraction\nfirst 3 chunks]
    CG --> S[Summary extraction\n3–5 sentences]
    CG --> RP[Chunk rephrasing\nplainer language per chunk]
    T & S & RP --> KG([KnowledgeGraph])

  • Title — short document title inferred from the opening chunks.
  • Summary — 3–5 sentence overview of the document.
  • Rephrase — each chunk is rewritten in plainer language; used in the HTML report’s toggle (original ↔︎ simplified).

8. Community Detection & Labeling

Module: knwler/clustering.pycluster_graph()

flowchart TD
    KG([KnowledgeGraph\nentities + relations]) --> NX[Build NetworkX graph\nnodes=entities · edges=relations\nweight=strength]
    NX --> LV[Louvain algorithm\ncommunity partitioning]
    LV --> CL["Communities\n[set_0, set_1, ..., set_k]"]
    CL --> LL[LLM label each community\n1–3 topics + description]
    LL --> OUT([entities tagged\nwith community_id])

  1. The consolidated graph is loaded into NetworkX (entities = nodes, relations = weighted edges).
  2. The Louvain algorithm (louvain_communities(), seed=0 for reproducibility) partitions nodes into disjoint communities.
  3. For each community, an LLM call returns 1–3 short topic labels and a sentence description.
  4. Each entity gains a community_id attribute.

9. Output

graph.json schema

{
  "id":       UUID,
  "title":    "...",
  "summary":  "...",
  "url":      "source URL (if fetched)",
  "language": "en",
  "schema": {
    "entity_types":   [...],
    "relation_types": [...],
    "reasoning":      "..."
  },
  "graph": {
    "entities": [
      { "name": "Alice Smith", "type": "Person",
        "description": "...", "chunk_ids": [...], "community_id": 0 }
    ],
    "relations": [
      { "source": "Alice Smith", "source_type": "Person",
        "target": "Tech Corp", "target_type": "Organization",
        "type": "works_at", "description": "...",
        "strength": 8.5, "chunk_ids": [...] }
    ],
    "communities": [
      { "id": 0, "topics": ["Leadership", "Management"],
        "description": "...", "members": ["Alice Smith::Person", ...] }
    ]
  },
  "chunks": [
    { "id": UUID, "chunk_idx": 0,
      "text": "original text",
      "rephrase": "simplified version",
      "entities": [...], "relations": [...] }
  ]
}

Output files

File Contents
graph.json Canonical structured knowledge graph
index.html Self-contained interactive report
graph.gml NetworkX export for Gephi / yEd
log.html Rich formatted processing log
log.txt Plain-text processing log

HTML Report

The HTML report (knwler/export.py + Jinja2 templates in knwler/templates/) is fully self-contained — no external dependencies, all CSS and JavaScript are inlined.

Features: - Interactive network visualisation with a degree-threshold slider - Entity index with type badges and links to source chunks - Community / topic overview with member links - Chunk viewer with original ↔︎ simplified toggle - Dark / light theme switch - Entity names in chunk text auto-link to the entity detail panel


Complete Data Flow

flowchart TD
    A([PDF / HTML / URL]) -->|fetch & cache| B([raw text])
    B -->|tiktoken chunking\n400 tok / 50 overlap| C(["[chunk_0 … chunk_n]"])
    C -->|LLM sample| D([Schema\nentity & relation types])
    C -->|LLM sample| E([Language code])
    D & E & C -->|8 concurrent LLM calls per chunk| F(["[ExtractionResult_0 … n]"])
    F -->|aggregate · LLM summarise · filter| G([Consolidated Graph])
    G -->|LLM: title + summary + rephrase| H([Enriched Graph])
    H -->|NetworkX Louvain + LLM labels| I([Communities])
    H & I -->|assemble| J([graph.json])
    J -->|Jinja2 render| K([index.html])
    J -->|NetworkX write| L([graph.gml])