flowchart TD
A([PDF / HTML / URL]) --> B[Fetch & Cache]
B --> C[Text Chunking]
C --> D[Language Detection]
C --> E[Schema Discovery]
D & E --> F[Parallel Chunk Extraction]
F --> G[Consolidation]
G --> H[Enrichments\ntitle · summary · rephrase]
G --> I[Community Detection]
H & I --> J[graph.json]
J --> K[HTML Report\nindex.html]
J --> L[GML Export\ngraph.gml]
Under the hood
This page walks through the full processing pipeline knwler runs when you give it a PDF or an HTML page — from raw bytes to an interactive knowledge-graph report.
Overview
1. Entry Point
Processing is triggered from the CLI:
python main.py extract -f document.pdf --backend openai --max-tokens 400main.py routes to knwler/cli.py, which dispatches to knwler/cli_extract.py. The programmatic API entry point is extract_file() in knwler/api.py.
2. Fetch & Collect
Module: knwler/collect/webpage.py — WebpageCollector
The source is normalised and fetched depending on its type:
| Source | Method | Output |
|---|---|---|
| Local PDF | fitz.open() text extraction |
Plain text |
| Local text / Markdown | Path.read_text() |
Plain text |
| URL → PDF | HTTP GET, bytes cached | Bytes |
| URL → HTML page | HTTP GET → BeautifulSoup → markdownify |
Markdown text |
| Wikipedia URL | WikipediaCollector |
Markdown + metadata |
Every fetch is hashed and stored under ~/.knwler/cache/ so repeated runs skip the network entirely.
flowchart LR
subgraph Input
A1[Local PDF]
A2[Local MD/TXT]
A3[Web URL]
A4[Wikipedia URL]
end
subgraph WebpageCollector
B1[fitz PDF extraction]
B2[Path.read_text]
B3[HTTP GET → markdownify]
B4[WikipediaCollector]
C[(~/.knwler/cache/)]
end
A1 --> B1
A2 --> B2
A3 --> B3
A4 --> B4
B1 & B2 & B3 & B4 --> C
C --> D([raw text])
3. Text Chunking
Module: knwler/chunking.py — chunk_text()
The raw text is tokenised using tiktoken (cl100k_base encoder) and cut into overlapping windows:
- Window size:
config.max_tokens(default 400 tokens) - Overlap:
config.overlap_tokens(default 50 tokens) - Sentence-boundary awareness: breaks at
.,!, or?near the end of each window to avoid mid-sentence splits
flowchart LR
T([raw text]) --> TK[tiktoken tokenise]
TK --> W["sliding window\n400 tokens / 50 overlap"]
W --> SB[sentence boundary\nadjustment]
SB --> CH(["[chunk_0, chunk_1, …, chunk_n]"])
4. Language Detection & Schema Discovery
Module: knwler/discovery.py
Two LLM calls run on a small sample of the text before the expensive per-chunk extraction begins.
sequenceDiagram
participant D as discovery.py
participant L as LLM
D->>L: detect_language(sample)
L-->>D: "en" (ISO 639-1)
D->>L: discover_schema(sample from start+mid+end)
L-->>D: Schema { entity_types, relation_types, reasoning }
Schema discovery prompt (condensed):
Analyze this text and identify the best entity types and relation types
for a knowledge graph.
TEXT SAMPLE: [~4000 chars from beginning, middle, end]
Guidelines:
- Use snake_case for type names
- Focus on types that appear multiple times
Return JSON:
{
"entity_types": ["Person", "Organization", ...],
"relation_types": ["works_at", "founded_by", ...],
"reasoning": "Brief explanation"
}
The resulting Schema drives all subsequent extraction — the LLM is told exactly which types to look for.
5. Parallel Chunk Extraction
Module: knwler/extraction.py — extract_all()
For every chunk an async LLM call is made. Up to config.max_concurrent (default 8) run in parallel, controlled by a semaphore.
flowchart TD
CH(["[chunk_0 … chunk_n]"]) --> S{semaphore\nmax 8 concurrent}
S --> E0[extract chunk_0]
S --> E1[extract chunk_1]
S --> EN[extract chunk_n]
E0 & E1 & EN --> P[(partial.json\nincremental save)]
P --> R(["[ExtractionResult_0 … ExtractionResult_n]"])
Extraction prompt (condensed):
Extract a knowledge graph from the text below.
ENTITY TYPES: [Person, Organization, ...]
RELATION TYPES: [works_at, founded_by, ...]
RULES:
- Only extract what is explicitly stated in the text
- Each entity: name, type, 1-2 sentence description
- Each relation: source, source_type, target, target_type,
type, description, strength (1-10)
TEXT: "..."
Return JSON:
{
"entities": [{"name": "...", "type": "...", "description": "..."}],
"relations": [{"source": "...", "source_type": "...",
"target": "...", "target_type": "...",
"type": "...", "description": "...", "strength": 8}]
}
Each chunk produces an ExtractionResult with its own entity and relation lists. Partial results are saved to output.partial.json during processing so progress survives interruptions.
LLM Backends
| Backend | Default model | Endpoint |
|---|---|---|
| Ollama | local (e.g. qwen2.5:3b) |
localhost:11434 |
| OpenAI | gpt-4o-mini |
api.openai.com |
| Anthropic | claude-haiku-4-5-20251001 |
api.anthropic.com |
| Google Gemini | gemini-3.1-flash-lite-preview |
generativelanguage.googleapis.com |
| GitHub Models | claude-opus-4.5 |
models.inference.ai.azure.com |
| LM Studio | local (user-loaded) | localhost:1234 |
All LLM responses are cached by a hash of (prompt, model, temperature) — repeated runs with the same content are free.
6. Consolidation
Module: knwler/consolidation.py — consolidate_extracted_graphs()
The many per-chunk graphs are merged in four phases:
flowchart TD
R(["[ExtractionResult_0 … n]"]) --> P1
subgraph Consolidation
P1["Phase 1 — Aggregate\ndeduplicate by (name, type)"]
P2["Phase 2 — Summarise descriptions\nLLM merge, batch 20 items/call"]
P3["Phase 3 — Build final graph\naverage strength values"]
P4["Phase 4 — Filter noise\nremove zero-degree / trivial nodes"]
P1 --> P2 --> P3 --> P4
end
P4 --> CG([Consolidated Graph])
| Phase | What happens |
|---|---|
| 1 — Aggregate | Entities deduplicated by (name.lower(), type.lower()); all descriptions and chunk IDs collected |
| 2 — Summarise | Entities/relations that appeared in multiple chunks get one merged description via LLM (batched 20/call) |
| 3 — Build | Final entity and relation lists assembled; relation strength averaged across chunks |
| 4 — Filter | Removes stopwords, pure numbers, and entities with no graph connections |
7. Enrichments
Module: knwler/extras.py
Three LLM tasks run concurrently after consolidation:
flowchart LR
CG([Consolidated Graph]) --> T[Title extraction\nfirst 3 chunks]
CG --> S[Summary extraction\n3–5 sentences]
CG --> RP[Chunk rephrasing\nplainer language per chunk]
T & S & RP --> KG([KnowledgeGraph])
- Title — short document title inferred from the opening chunks.
- Summary — 3–5 sentence overview of the document.
- Rephrase — each chunk is rewritten in plainer language; used in the HTML report’s toggle (original ↔︎ simplified).
8. Community Detection & Labeling
Module: knwler/clustering.py — cluster_graph()
flowchart TD
KG([KnowledgeGraph\nentities + relations]) --> NX[Build NetworkX graph\nnodes=entities · edges=relations\nweight=strength]
NX --> LV[Louvain algorithm\ncommunity partitioning]
LV --> CL["Communities\n[set_0, set_1, ..., set_k]"]
CL --> LL[LLM label each community\n1–3 topics + description]
LL --> OUT([entities tagged\nwith community_id])
- The consolidated graph is loaded into NetworkX (entities = nodes, relations = weighted edges).
- The Louvain algorithm (
louvain_communities(), seed=0 for reproducibility) partitions nodes into disjoint communities. - For each community, an LLM call returns 1–3 short topic labels and a sentence description.
- Each entity gains a
community_idattribute.
9. Output
graph.json schema
{
"id": UUID,
"title": "...",
"summary": "...",
"url": "source URL (if fetched)",
"language": "en",
"schema": {
"entity_types": [...],
"relation_types": [...],
"reasoning": "..."
},
"graph": {
"entities": [
{ "name": "Alice Smith", "type": "Person",
"description": "...", "chunk_ids": [...], "community_id": 0 }
],
"relations": [
{ "source": "Alice Smith", "source_type": "Person",
"target": "Tech Corp", "target_type": "Organization",
"type": "works_at", "description": "...",
"strength": 8.5, "chunk_ids": [...] }
],
"communities": [
{ "id": 0, "topics": ["Leadership", "Management"],
"description": "...", "members": ["Alice Smith::Person", ...] }
]
},
"chunks": [
{ "id": UUID, "chunk_idx": 0,
"text": "original text",
"rephrase": "simplified version",
"entities": [...], "relations": [...] }
]
}
Output files
| File | Contents |
|---|---|
graph.json |
Canonical structured knowledge graph |
index.html |
Self-contained interactive report |
graph.gml |
NetworkX export for Gephi / yEd |
log.html |
Rich formatted processing log |
log.txt |
Plain-text processing log |
HTML Report
The HTML report (knwler/export.py + Jinja2 templates in knwler/templates/) is fully self-contained — no external dependencies, all CSS and JavaScript are inlined.
Features: - Interactive network visualisation with a degree-threshold slider - Entity index with type badges and links to source chunks - Community / topic overview with member links - Chunk viewer with original ↔︎ simplified toggle - Dark / light theme switch - Entity names in chunk text auto-link to the entity detail panel
Complete Data Flow
flowchart TD
A([PDF / HTML / URL]) -->|fetch & cache| B([raw text])
B -->|tiktoken chunking\n400 tok / 50 overlap| C(["[chunk_0 … chunk_n]"])
C -->|LLM sample| D([Schema\nentity & relation types])
C -->|LLM sample| E([Language code])
D & E & C -->|8 concurrent LLM calls per chunk| F(["[ExtractionResult_0 … n]"])
F -->|aggregate · LLM summarise · filter| G([Consolidated Graph])
G -->|LLM: title + summary + rephrase| H([Enriched Graph])
H -->|NetworkX Louvain + LLM labels| I([Communities])
H & I -->|assemble| J([graph.json])
J -->|Jinja2 render| K([index.html])
J -->|NetworkX write| L([graph.gml])