Features
A complete reference of everything Knwler can do, organized by topic.
Table of Contents
- Document Ingestion
- LLM Backends
- Automatic Schema Discovery
- Entity Disambiguation
- Knowledge Graph Extraction
- Multi-Document Consolidation
- Community Detection & Topic Assignment
- Multilingual Support
- HTML Report Export
- Export Formats
- Graph Database Integration
- Graph Analytics
- Batch Processing
- Intelligent Caching
- Data Fetching
- Python API
- CLI
- Benchmarking
- Installation
Document Ingestion
Knwler can ingest documents from multiple sources:
| Source | How |
|---|---|
| Local PDF | --file document.pdf |
| Local text / Markdown | --file document.md |
| Remote PDF (URL) | --file https://example.com/report.pdf |
| Web page (URL) | knwler fetch url https://example.com/page |
| Wikipedia article | knwler fetch wiki "quantum computing" |
| Directory of files | --directory ./pdfs/ |
Long documents are automatically split into overlapping token-based chunks before extraction. The chunk size and overlap are configurable via --max-tokens (default 400) and --overlap-tokens (default 50).
See Fetching Data and CLI.
LLM Backends
Knwler supports five backends. Switch between them with --backend:
| Backend | Type | Default extraction model | Default discovery model | Docs |
|---|---|---|---|---|
| Ollama | Local | qwen2.5:3b |
qwen2.5:14b |
Ollama |
| LM Studio | Local | glm-4.7-flash |
glm-4.7-flash |
LM Studio |
| OpenAI | Cloud | gpt-4o-mini |
gpt-4o |
OpenAI |
| Anthropic | Cloud | claude-haiku-4-5-20251001 |
claude-sonnet-4-6 |
Anthropic |
| Google Gemini | Cloud | gemini-3.1-flash-lite-preview |
gemini-3.1-flash-lite-preview |
Models |
The pipeline uses two separate model slots — one for fast chunk-level extraction (--extraction-model) and one for higher-quality discovery and summarization tasks (--discovery-model). Both can be overridden independently on the command line or in Config.
Ollama and LM Studio run fully offline — no data ever leaves your machine.
Tip: Smaller models (3B–14B) consistently outperform larger ones for structured entity/relation extraction. Thinking mode is disabled by default; it adds latency without improving graph quality.
See Models & Providers.
Automatic Schema Discovery
Before extracting entities, Knwler analyzes a sample of the document and infers the optimal entity types and relation types for that specific document. No manual ontology engineering is required.
from knwler.discovery import discover_schema
schema = await discover_schema(text, config)
# schema.entity_types → ["Person", "Organization", "Legislation", ...]
# schema.relation_types → ["enacted_by", "amended_by", "applies_to", ...]You can also supply your own schema and bypass discovery entirely.
Entity Disambiguation
When the same name refers to different real-world things (e.g. Apple the company vs. apple the fruit), Knwler identifies nodes by name + type as a composite key. Disambiguated entities are flagged with type badges in the exported HTML report so readers immediately see the distinction.
Knowledge Graph Extraction
The extraction pipeline:
- Chunk the document into overlapping token windows.
- Detect language of the document (ISO 639-1 code).
- Discover schema — infer entity and relation types.
- Extract a small subgraph from every chunk in parallel (async).
- Consolidate the per-chunk graphs into a single unified graph.
- (Optional) Rephrase chunks and generate a title and summary.
Each entity carries a name, type, description, and importance score. Each relation carries source, target, type, description, and strength.
The final graph.json is the canonical output used by all downstream commands.
Multi-Document Consolidation
When processing more than one document, Knwler can merge all resulting knowledge graphs into a single unified graph. Entity descriptions from multiple sources are intelligently summarized by an LLM so that the merged description is coherent rather than duplicated.
# Consolidate all graphs in a results directory
knwler consolidate --input ./results/
# Process a directory and consolidate in one step
knwler extract --directory ./pdfs/ --consolidate --backend openaiMulti-document graphs carry a document node linking entities to the source file they came from.
See CLI.
Community Detection & Topic Assignment
After the graph is assembled, Knwler runs the Louvain community-detection algorithm to discover clusters of densely connected entities. An LLM then labels each community with a human-readable topic name.
The result is an instant thematic map of the document — “Tax Liability”, “Regulatory Bodies”, “Supply Chain Risks”, etc. — without any manual tagging.
Clusters are listed in the HTML report and are accessible in graph.json under clusters.
Multilingual Support
Language is auto-detected on every run using an LLM call. All extraction prompts and the HTML report UI are then rendered in the detected language.
Supported languages in v1.0:
- English
- German
- French
- Spanish
- Dutch
- Italian
- Portuguese
- Simplified Chinese
All language strings (prompts, UI labels, console output) live in a single languages.json file. Adding a new language requires only adding a translated entry to that file.
HTML Report Export
Every extraction produces a self-contained HTML file (index.html) that can be opened in any browser without a server or extra dependencies:
- Interactive network visualization with a degree-threshold slider to reduce visual clutter
- Dark / light theme toggle
- Entity index with type badges and source-chunk links
- Topic / cluster overview
- Rephrased text chunks with Markdown rendered correctly
- Document title and summary
Templates
Three built-in templates ship with Knwler:
| Template | Description |
|---|---|
default |
Standard report with network visualization |
columns |
3-column layout without graph visualization |
graph_analysis |
Graph-visualization-focused with custom layout |
Templates are Jinja2 files — fully customizable for your brand or use case. A blank template provides the bare scaffolding as a starting point.
knwler extract -f document.pdf --template columnsRe-rendering with a different template uses the cache and requires zero additional LLM calls.
See Templates and HTML Export.
Export Formats
| Format | File | Use |
|---|---|---|
| JSON | graph.json |
Canonical output; used by all downstream commands |
| GML | graph.gml |
Open in yEd, Gephi, or any GML-compatible tool |
| GraphML | via graph convert |
XML-based; compatible with graph analysis libraries |
| JSONLD / RDF | via graph convert --format jsonld |
Triple stores: GraphDB, StarDog, AWS Neptune |
| HTML | index.html |
Self-contained interactive report |
Convert between formats with:
knwler graph convert --format gml
knwler graph convert --format graphml
knwler graph convert --format jsonldJSONLD output can be re-serialized to Turtle, N-Triples, or any other RDF format using rdflib.
See JSONLD / RDF and Visualization.
Graph Database Integration
Import scripts and documented workflows are provided for:
| Target | Method | Docs |
|---|---|---|
| Neo4j | integrations/neo4j_import.py — creates constraints and indexes |
Neo4j |
| SurrealDB | integrations/surreal_import.py |
SurrealDB |
| GraphDB | Import JSONLD into any triple store | GraphDB |
| AWS Neptune | SPARQL, OpenCypher, or bulk loading via integrations/aws_import.py |
Neptune |
uv run integrations/neo4j_import.py ./results/graph.json
uv run integrations/surreal_import.py ./results/graph.json
knwler graph convert --format jsonld # then load into GraphDB / NeptuneGraph Analytics
knwler graph analyze runs a suite of graph-theoretic analyses on a graph.json file and produces a comprehensive HTML report:
- Most important entities (by degree centrality, betweenness, etc.)
- Most significant chunks
- Community / cluster structure
- Key statistics (node count, edge count, density, etc.)
knwler graph analyze ./results/graph.jsonThe report is generated from a Jinja2 template and is fully customizable.
See Graph Analytics.
Batch Processing
For large document sets, use the OpenAI or Gemini batch APIs to reduce cost and processing time:
- Performs the full extraction pipeline and packages calls into batch requests.
- Submits batches, polls for completion (default: every 10 minutes), and downloads results automatically.
- Uses a SQLite database to track state — interrupted runs resume exactly where they left off.
- Optional
--consolidateflag merges all results at the end.
# Start or resume a batch job
knwler batch run --input ./documents/ --output ./results/ --backend openai
# Check job status
knwler batch run --input ./documents/ --output ./results/ --status
# Consolidate results after the batch completes
knwler batch consolidate --input ./documents/ --output ./results/See Batch Processing.
Intelligent Caching
Every LLM response, fetched document, parsed PDF, and Wikipedia article is hashed and stored locally under ~/.knwler/cache. The cache is organized into four namespaces:
| Namespace | Contents |
|---|---|
llm |
All LLM API responses (all backends) |
documents |
Fetched and parsed PDF / DOCX files |
webpages |
Fetched web pages |
wikipedia |
Fetched Wikipedia articles |
Re-running with a different template, export format, or backend configuration replays from cache — zero additional API calls and zero cost.
knwler cache clear # clear everything
knwler cache clear --llm # clear only LLM cacheSee Caching.
Data Fetching
Knwler can retrieve documents directly from the web without a manual download step:
# Fetch a remote PDF and run extraction in one step
knwler fetch url https://example.com/report.pdf --parse
# Fetch a webpage and extract from it
knwler fetch url https://example.com/article --parse
# Fetch a Wikipedia article by topic name
knwler fetch wiki "quantum computing"
# Fetch and open in browser after extraction
knwler fetch wiki "quantum computing" --openAll fetched content is cached so subsequent runs are instantaneous.
See Fetching Data.
Python API
The full pipeline is exposed as an async Python library for embedding in your own applications or notebooks.
import asyncio
from knwler.api import extract_graph # high-level one-call API
from knwler.chunking import chunk_text
from knwler.config import Config
from knwler.discovery import detect_language, discover_schema
from knwler.extras import rephrase_chunks, extract_title, extract_summary
from knwler.extraction import extract_chunk, extract_all
from knwler.consolidation import consolidate_extracted_graphs
async def main():
text = open("my_document.md").read()
config = Config(backend="openai", openai_api_key="sk-...")
chunks = chunk_text(text, config)
lang = await detect_language(text, config)
schema = await discover_schema(text, config)
title = await extract_title(chunks, config)
graphs = await extract_all(chunks, schema, config)
consolidated, elapsed = await consolidate_extracted_graphs(graphs, config)
asyncio.run(main())Key classes and modules:
| Symbol | Description |
|---|---|
Config |
Central configuration dataclass accepted by every API function |
chunk_text() |
Split text into overlapping token-based chunks |
detect_language() |
Return ISO 639-1 code for the document’s language |
discover_schema() |
Infer entity and relation types from a document sample |
extract_all() |
Async extraction of all chunks in parallel |
consolidate_extracted_graphs() |
Merge per-chunk graphs into one unified graph |
extract_title() |
Generate a document title from chunks |
extract_summary() |
Generate a document summary from chunks |
rephrase_chunks() |
Rephrase raw chunks for readability |
See API documentation.
CLI
The CLI is organized into subcommands. When installed via pipx or pip, use knwler; when running from source, use uv run main.py.
| Command | Description |
|---|---|
extract |
Extract a knowledge graph from a file, URL, or directory (default) |
fetch url |
Fetch and optionally parse a URL (PDF or web page) |
fetch wiki |
Fetch and optionally parse a Wikipedia article |
consolidate |
Merge multiple knowledge graphs into one |
graph convert |
Convert graph.json to GML, GraphML, JSONLD, etc. |
graph analyze |
Produce an analytical report of the knowledge graph |
batch run |
Batch process a directory using OpenAI or Gemini batch API |
batch consolidate |
Consolidate results from a batch run |
cache clear |
Clear cached LLM responses, documents, or Wikipedia articles |
demo |
Run a quick demo extraction on the Human Rights Declaration |
info |
Show version and configuration info |
Commonly used extract flags:
| Flag | Default | Description |
|---|---|---|
--file / -f |
— | Input file path or URL |
--directory / --dir |
— | Process all files in a directory |
--backend |
ollama |
LLM backend to use |
--extraction-model |
backend default | Override the extraction model |
--discovery-model |
backend default | Override the discovery model |
--output |
./results/ |
Output directory |
--overwrite |
false | Overwrite existing output instead of creating a new versioned directory |
--template |
default |
HTML report template |
--max-tokens |
400 |
Chunk size in tokens |
--html-only |
false | Re-render the HTML report from an existing graph.json |
--consolidate |
false | Consolidate after processing a directory |
See CLI Reference.
Benchmarking
The built-in benchmark suite lets you compare extraction quality and speed across any combination of backends and models:
uv run main.py benchmark runThe benchmark grid (benchmark/grid.json) is fully configurable. Results are scored using the Knowledge Yield Score (KYS):
\[ \text{KYS} = \sqrt{\frac{\text{graph\_size}}{\max(\text{graph\_size})} \cdot \frac{\min(\text{total\_time})}{\text{total\_time}}} \]
KYS is a geometric mean of normalized quality (graph size) and normalized speed — it cannot be gamed by excelling at only one dimension. Results are rendered as a sortable HTML report.
See Benchmark.
Installation
Knwler requires Python 3.12.
# Recommended: isolated global install
pipx install knwler
# As a project dependency
uv add knwler
pip install knwler
# From source
git clone https://github.com/Orbifold/knwler
cd knwler
uv sync # core dependencies only
uv sync --all-groups # include Neo4j, SurrealDB, data-collection extrasOptional dependency groups:
| Group | Contents |
|---|---|
neo4j |
Neo4j Python driver |
surrealdb |
SurrealDB Python driver |
collect |
Document and web-page collection utilities |
See Setup & Installation and pipx.