Knwler is an enterprise document intelligence platform that extracts structured knowledge graphs from documents using Large Language Models. It identifies entities, relationships, and topics from PDFs, text files, and Markdown — producing interactive reports and exports for graph analytics platforms.

Can Knwler run fully on-premise?

Yes. Knwler supports fully air-gapped operation via Ollama with open-weight models. No data leaves your infrastructure, making it suitable for classified and regulated environments with strict data sovereignty requirements.

What languages does Knwler support?

Knwler auto-detects document language and supports English, German, French, Spanish, and Dutch out of the box. Additional languages can be added by extending a single configuration file.

What export formats are available?

Knwler exports to JSON, GML, GraphML, and interactive HTML. These can be imported directly into Neo4j, Gephi, yEd, Memgraph, SurrealDB, or used to generate vector embeddings for semantic search.

How much does it cost to process a document?

Using cloud LLMs (OpenAI GPT-4o), processing costs approximately $0.20 per 20-page document. Running on-premise with local models is completely free after initial setup. Intelligent caching means re-runs cost nothing.

Graph JSON Format

Knwler extracts a JSON file which you can use for various downstream tasks. The format is deliberately kept simple and is structured as follows.

Top-level fields

Field	Type	Description
`id`	string (UUID)	Unique identifier for this graph document.
`title`	string	Title of the source document.
`summary`	string	LLM-generated summary of the full document.
`url`	string	URL or file path of the source document.
`language`	string	ISO 639-1 language code of the source text (e.g. `en`).
`schema`	object	Discovered entity/relation schema — see Schema.
`stats`	array	One entry per extraction run — see Stats.
`graph`	object	Consolidated graph of entities, relations, and communities — see Graph.
`chunks`	array	Source text chunks with per-chunk extraction results — see Chunks.

{
  "id": "8a2f11fe-28f8-4e93-9757-332fefefae88",
  "title": "Universal Declaration of Human Rights",
  "summary": "The UDHR proclaims that all human beings possess inherent dignity...",
  "url": "https://knwler.com/pdfs/HumanRights.pdf",
  "language": "en",
  "schema": { ... },
  "stats": [ ... ],
  "graph": { ... },
  "chunks": [ ... ]
}

Schema

The schema object captures the entity and relation taxonomy that Knwler discovered (or was given) for this document.

Field	Type	Description
`entity_types`	string[]	List of entity type labels used across the graph.
`relation_types`	string[]	List of relation type labels used across the graph.
`reasoning`	string	Explanation of why this particular schema was chosen.

"schema": {
  "entity_types": ["human_right", "freedom", "legal_document", "article", ...],
  "relation_types": ["grants_right_to", "protects", "limits", ...],
  "reasoning": "The UDHR text centers on human rights and freedoms..."
}

Graph

The graph object contains three arrays that together represent the consolidated knowledge graph.

"graph": {
  "entities": [ ... ],
  "relations": [ ... ],
  "communities": [ ... ]
}

Entities

Each item in graph.entities represents a named concept extracted from the document.

Field	Type	Description
`id`	string (UUID)	Unique identifier for this entity.
`name`	string	The entity’s canonical name.
`type`	string	Entity type from the schema (e.g. `legal_document`, `person`).
`description`	string	LLM-generated description of the entity.
`chunk_ids`	string[]	IDs of the chunks where this entity appears.
`community_id`	integer	ID of the community this entity belongs to.

{
  "id": "5cca91e1-f3ae-48fd-b6d5-e2e807b45df3",
  "name": "Universal Declaration of Human Rights",
  "type": "legal_document",
  "description": "A proclamation by the General Assembly establishing a common standard...",
  "chunk_ids": ["9454ce7f-55db-478b-982f-ccf7defce449"],
  "community_id": 0
}

Relations

Each item in graph.relations represents a directed relationship between two entities.

Field	Type	Description
`id`	string (UUID)	Unique identifier for this relation.
`source`	string	Name of the source entity.
`source_type`	string	Type of the source entity.
`target`	string	Name of the target entity.
`target_type`	string	Type of the target entity.
`type`	string	Relation type from the schema (e.g. `protects`, `is_part_of`).
`description`	string	LLM-generated description of this relationship.
`strength`	number	Confidence/salience score, typically 1–10.
`chunk_ids`	string[]	IDs of the chunks where this relation was found.

{
  "id": "cd593bcf-8b92-45eb-a5b2-4680440786e2",
  "source": "Universal Declaration of Human Rights",
  "source_type": "legal_document",
  "target": "human rights",
  "target_type": "human_right",
  "type": "protects",
  "description": "The Declaration protects and establishes human rights as a common standard.",
  "strength": 9.0,
  "chunk_ids": ["9454ce7f-55db-478b-982f-ccf7defce449"]
}

Communities

Communities are clusters of closely related entities discovered via graph community detection.

Field	Type	Description
`id`	integer	Community identifier (matches `community_id` on entities).
`topics`	string[]	Short topic labels summarising the community’s theme.
`description`	string	LLM-generated description of the community.
`members`	string[]	Member entity references in `"name::type"` format.

{
  "id": 0,
  "topics": ["Universal Human Rights", "UN Framework", "Fundamental Freedoms"],
  "description": "The foundational community establishing the UDHR...",
  "members": [
    "General Assembly::organization",
    "human rights::human_right",
    "rule of law::legal_principle"
  ]
}

Chunks

The chunks array contains one entry per text segment that was processed during extraction. Each chunk preserves both the original text and the entities/relations found locally within it.

Field	Type	Description
`id`	string (UUID)	Unique identifier for this chunk (referenced by `chunk_ids` elsewhere).
`chunk_idx`	integer	Zero-based sequential index of the chunk.
`text`	string	The original source text segment.
`rephrase`	string	A plain-language rephrase of the text, generated by the LLM.
`entities`	array	Entities extracted from this chunk (same fields as graph entities, without `id` and `community_id`).
`relations`	array	Relations extracted from this chunk (same fields as graph relations, without `id`).

{
  "id": "9454ce7f-55db-478b-982f-ccf7defce449",
  "chunk_idx": 0,
  "text": "Universal Declaration of Human Rights\nPreamble\n...",
  "rephrase": "Why This Declaration Matters: All people have equal and basic human rights...",
  "entities": [ ... ],
  "relations": [ ... ]
}

Stats

The stats array holds one object per extraction run, recording timing, token usage, and configuration. Multiple entries accumulate if the document is re-processed.

Field	Type	Description
`num_chunks`	integer	Number of text chunks processed.
`schema_discovery_time`	number	Seconds spent on schema discovery.
`extraction_wall_time`	number	Wall-clock seconds for the extraction phase.
`consolidation_time`	number	Seconds spent merging per-chunk results.
`total_cpu_time`	number	Total CPU seconds consumed.
`total_time`	number	End-to-end elapsed seconds.
`parallelism`	number	Effective parallelism ratio during extraction.
`throughput_tps`	number	Tokens per second throughput.
`time`	object	Per-chunk timing stats: `min`, `max`, `avg`, `median`.
`tokens`	object	Per-chunk token stats: `min`, `max`, `avg`, `total`.
`entities`	object	Entity counts: `total`, `avg` per chunk.
`relations`	object	Relation counts: `total`, `avg` per chunk.
`community_detection_time`	number	Seconds spent on community detection.
`communities`	object	Community stats: `count`, `singletons`, `largest`, `size` (min/max/avg/median).
`run`	object	Execution context — see below.

The nested run object captures the configuration used:

Field	Type	Description
`timestamp`	string	ISO-style datetime of the run.
`file`	string	Path to the input file.
`backend`	string	LLM backend used (e.g. `anthropic`, `openai`, `ollama`).
`extraction_model`	string	Model used for chunk-level extraction.
`discovery_model`	string	Model used for schema discovery.
`max_tokens`	integer	Maximum tokens per chunk.
`overlap_tokens`	integer	Token overlap between consecutive chunks.
`max_concurrent`	integer	Maximum concurrent LLM requests.
`use_cache`	boolean	Whether the LLM response cache was enabled.
`no_discovery`	boolean	Whether schema discovery was skipped.