Features

A complete reference of everything Knwler can do, organized by topic.


Table of Contents

  1. Document Ingestion
  2. LLM Backends
  3. Automatic Schema Discovery
  4. Entity Disambiguation
  5. Knowledge Graph Extraction
  6. Multi-Document Consolidation
  7. Community Detection & Topic Assignment
  8. Multilingual Support
  9. HTML Report Export
  10. Export Formats
  11. Graph Database Integration
  12. Graph Analytics
  13. Batch Processing
  14. Intelligent Caching
  15. Data Fetching
  16. Python API
  17. CLI
  18. Benchmarking
  19. Installation

Document Ingestion

Knwler can ingest documents from multiple sources:

Source How
Local PDF --file document.pdf
Local text / Markdown --file document.md
Remote PDF (URL) --file https://example.com/report.pdf
Web page (URL) knwler fetch url https://example.com/page
Wikipedia article knwler fetch wiki "quantum computing"
Directory of files --directory ./pdfs/

Long documents are automatically split into overlapping token-based chunks before extraction. The chunk size and overlap are configurable via --max-tokens (default 400) and --overlap-tokens (default 50).

See Fetching Data and CLI.


LLM Backends

Knwler supports five backends. Switch between them with --backend:

Backend Type Default extraction model Default discovery model Docs
Ollama Local qwen2.5:3b qwen2.5:14b Ollama
LM Studio Local glm-4.7-flash glm-4.7-flash LM Studio
OpenAI Cloud gpt-4o-mini gpt-4o OpenAI
Anthropic Cloud claude-haiku-4-5-20251001 claude-sonnet-4-6 Anthropic
Google Gemini Cloud gemini-3.1-flash-lite-preview gemini-3.1-flash-lite-preview Models

The pipeline uses two separate model slots — one for fast chunk-level extraction (--extraction-model) and one for higher-quality discovery and summarization tasks (--discovery-model). Both can be overridden independently on the command line or in Config.

Ollama and LM Studio run fully offline — no data ever leaves your machine.

Tip: Smaller models (3B–14B) consistently outperform larger ones for structured entity/relation extraction. Thinking mode is disabled by default; it adds latency without improving graph quality.

See Models & Providers.


Automatic Schema Discovery

Before extracting entities, Knwler analyzes a sample of the document and infers the optimal entity types and relation types for that specific document. No manual ontology engineering is required.

from knwler.discovery import discover_schema

schema = await discover_schema(text, config)
# schema.entity_types  → ["Person", "Organization", "Legislation", ...]
# schema.relation_types → ["enacted_by", "amended_by", "applies_to", ...]

You can also supply your own schema and bypass discovery entirely.


Entity Disambiguation

When the same name refers to different real-world things (e.g. Apple the company vs. apple the fruit), Knwler identifies nodes by name + type as a composite key. Disambiguated entities are flagged with type badges in the exported HTML report so readers immediately see the distinction.


Knowledge Graph Extraction

The extraction pipeline:

  1. Chunk the document into overlapping token windows.
  2. Detect language of the document (ISO 639-1 code).
  3. Discover schema — infer entity and relation types.
  4. Extract a small subgraph from every chunk in parallel (async).
  5. Consolidate the per-chunk graphs into a single unified graph.
  6. (Optional) Rephrase chunks and generate a title and summary.

Each entity carries a name, type, description, and importance score. Each relation carries source, target, type, description, and strength.

The final graph.json is the canonical output used by all downstream commands.


Multi-Document Consolidation

When processing more than one document, Knwler can merge all resulting knowledge graphs into a single unified graph. Entity descriptions from multiple sources are intelligently summarized by an LLM so that the merged description is coherent rather than duplicated.

# Consolidate all graphs in a results directory
knwler consolidate --input ./results/

# Process a directory and consolidate in one step
knwler extract --directory ./pdfs/ --consolidate --backend openai

Multi-document graphs carry a document node linking entities to the source file they came from.

See CLI.


Community Detection & Topic Assignment

After the graph is assembled, Knwler runs the Louvain community-detection algorithm to discover clusters of densely connected entities. An LLM then labels each community with a human-readable topic name.

The result is an instant thematic map of the document — “Tax Liability”, “Regulatory Bodies”, “Supply Chain Risks”, etc. — without any manual tagging.

Clusters are listed in the HTML report and are accessible in graph.json under clusters.


Multilingual Support

Language is auto-detected on every run using an LLM call. All extraction prompts and the HTML report UI are then rendered in the detected language.

Supported languages in v1.0:

  • English
  • German
  • French
  • Spanish
  • Dutch
  • Italian
  • Portuguese
  • Simplified Chinese

All language strings (prompts, UI labels, console output) live in a single languages.json file. Adding a new language requires only adding a translated entry to that file.

See Language & Localization.


HTML Report Export

Every extraction produces a self-contained HTML file (index.html) that can be opened in any browser without a server or extra dependencies:

  • Interactive network visualization with a degree-threshold slider to reduce visual clutter
  • Dark / light theme toggle
  • Entity index with type badges and source-chunk links
  • Topic / cluster overview
  • Rephrased text chunks with Markdown rendered correctly
  • Document title and summary

Templates

Three built-in templates ship with Knwler:

Template Description
default Standard report with network visualization
columns 3-column layout without graph visualization
graph_analysis Graph-visualization-focused with custom layout

Templates are Jinja2 files — fully customizable for your brand or use case. A blank template provides the bare scaffolding as a starting point.

knwler extract -f document.pdf --template columns

Re-rendering with a different template uses the cache and requires zero additional LLM calls.

See Templates and HTML Export.


Export Formats

Format File Use
JSON graph.json Canonical output; used by all downstream commands
GML graph.gml Open in yEd, Gephi, or any GML-compatible tool
GraphML via graph convert XML-based; compatible with graph analysis libraries
JSONLD / RDF via graph convert --format jsonld Triple stores: GraphDB, StarDog, AWS Neptune
HTML index.html Self-contained interactive report

Convert between formats with:

knwler graph convert --format gml
knwler graph convert --format graphml
knwler graph convert --format jsonld

JSONLD output can be re-serialized to Turtle, N-Triples, or any other RDF format using rdflib.

See JSONLD / RDF and Visualization.


Graph Database Integration

Import scripts and documented workflows are provided for:

Target Method Docs
Neo4j integrations/neo4j_import.py — creates constraints and indexes Neo4j
SurrealDB integrations/surreal_import.py SurrealDB
GraphDB Import JSONLD into any triple store GraphDB
AWS Neptune SPARQL, OpenCypher, or bulk loading via integrations/aws_import.py Neptune
uv run integrations/neo4j_import.py ./results/graph.json
uv run integrations/surreal_import.py ./results/graph.json
knwler graph convert --format jsonld   # then load into GraphDB / Neptune

Graph Analytics

knwler graph analyze runs a suite of graph-theoretic analyses on a graph.json file and produces a comprehensive HTML report:

  • Most important entities (by degree centrality, betweenness, etc.)
  • Most significant chunks
  • Community / cluster structure
  • Key statistics (node count, edge count, density, etc.)
knwler graph analyze ./results/graph.json

The report is generated from a Jinja2 template and is fully customizable.

See Graph Analytics.


Batch Processing

For large document sets, use the OpenAI or Gemini batch APIs to reduce cost and processing time:

  • Performs the full extraction pipeline and packages calls into batch requests.
  • Submits batches, polls for completion (default: every 10 minutes), and downloads results automatically.
  • Uses a SQLite database to track state — interrupted runs resume exactly where they left off.
  • Optional --consolidate flag merges all results at the end.
# Start or resume a batch job
knwler batch run --input ./documents/ --output ./results/ --backend openai

# Check job status
knwler batch run --input ./documents/ --output ./results/ --status

# Consolidate results after the batch completes
knwler batch consolidate --input ./documents/ --output ./results/

See Batch Processing.


Intelligent Caching

Every LLM response, fetched document, parsed PDF, and Wikipedia article is hashed and stored locally under ~/.knwler/cache. The cache is organized into four namespaces:

Namespace Contents
llm All LLM API responses (all backends)
documents Fetched and parsed PDF / DOCX files
webpages Fetched web pages
wikipedia Fetched Wikipedia articles

Re-running with a different template, export format, or backend configuration replays from cache — zero additional API calls and zero cost.

knwler cache clear          # clear everything
knwler cache clear --llm    # clear only LLM cache

See Caching.


Data Fetching

Knwler can retrieve documents directly from the web without a manual download step:

# Fetch a remote PDF and run extraction in one step
knwler fetch url https://example.com/report.pdf --parse

# Fetch a webpage and extract from it
knwler fetch url https://example.com/article --parse

# Fetch a Wikipedia article by topic name
knwler fetch wiki "quantum computing"

# Fetch and open in browser after extraction
knwler fetch wiki "quantum computing" --open

All fetched content is cached so subsequent runs are instantaneous.

See Fetching Data.


Python API

The full pipeline is exposed as an async Python library for embedding in your own applications or notebooks.

import asyncio
from knwler.api import extract_graph          # high-level one-call API
from knwler.chunking import chunk_text
from knwler.config import Config
from knwler.discovery import detect_language, discover_schema
from knwler.extras import rephrase_chunks, extract_title, extract_summary
from knwler.extraction import extract_chunk, extract_all
from knwler.consolidation import consolidate_extracted_graphs

async def main():
    text = open("my_document.md").read()
    config = Config(backend="openai", openai_api_key="sk-...")

    chunks = chunk_text(text, config)
    lang   = await detect_language(text, config)
    schema = await discover_schema(text, config)
    title  = await extract_title(chunks, config)
    graphs = await extract_all(chunks, schema, config)
    consolidated, elapsed = await consolidate_extracted_graphs(graphs, config)

asyncio.run(main())

Key classes and modules:

Symbol Description
Config Central configuration dataclass accepted by every API function
chunk_text() Split text into overlapping token-based chunks
detect_language() Return ISO 639-1 code for the document’s language
discover_schema() Infer entity and relation types from a document sample
extract_all() Async extraction of all chunks in parallel
consolidate_extracted_graphs() Merge per-chunk graphs into one unified graph
extract_title() Generate a document title from chunks
extract_summary() Generate a document summary from chunks
rephrase_chunks() Rephrase raw chunks for readability

See API documentation.


CLI

The CLI is organized into subcommands. When installed via pipx or pip, use knwler; when running from source, use uv run main.py.

Command Description
extract Extract a knowledge graph from a file, URL, or directory (default)
fetch url Fetch and optionally parse a URL (PDF or web page)
fetch wiki Fetch and optionally parse a Wikipedia article
consolidate Merge multiple knowledge graphs into one
graph convert Convert graph.json to GML, GraphML, JSONLD, etc.
graph analyze Produce an analytical report of the knowledge graph
batch run Batch process a directory using OpenAI or Gemini batch API
batch consolidate Consolidate results from a batch run
cache clear Clear cached LLM responses, documents, or Wikipedia articles
demo Run a quick demo extraction on the Human Rights Declaration
info Show version and configuration info

Commonly used extract flags:

Flag Default Description
--file / -f Input file path or URL
--directory / --dir Process all files in a directory
--backend ollama LLM backend to use
--extraction-model backend default Override the extraction model
--discovery-model backend default Override the discovery model
--output ./results/ Output directory
--overwrite false Overwrite existing output instead of creating a new versioned directory
--template default HTML report template
--max-tokens 400 Chunk size in tokens
--html-only false Re-render the HTML report from an existing graph.json
--consolidate false Consolidate after processing a directory

See CLI Reference.


Benchmarking

The built-in benchmark suite lets you compare extraction quality and speed across any combination of backends and models:

uv run main.py benchmark run

The benchmark grid (benchmark/grid.json) is fully configurable. Results are scored using the Knowledge Yield Score (KYS):

\[ \text{KYS} = \sqrt{\frac{\text{graph\_size}}{\max(\text{graph\_size})} \cdot \frac{\min(\text{total\_time})}{\text{total\_time}}} \]

KYS is a geometric mean of normalized quality (graph size) and normalized speed — it cannot be gamed by excelling at only one dimension. Results are rendered as a sortable HTML report.

See Benchmark.


Installation

Knwler requires Python 3.12.

# Recommended: isolated global install
pipx install knwler

# As a project dependency
uv add knwler
pip install knwler

# From source
git clone https://github.com/Orbifold/knwler
cd knwler
uv sync                    # core dependencies only
uv sync --all-groups       # include Neo4j, SurrealDB, data-collection extras

Optional dependency groups:

Group Contents
neo4j Neo4j Python driver
surrealdb SurrealDB Python driver
collect Document and web-page collection utilities

See Setup & Installation and pipx.