Graph JSON Format
Knwler extracts a JSON file which you can use for various downstream tasks. The format is deliberately kept simple and is structured as follows.
Top-level fields
| Field | Type | Description |
|---|---|---|
id |
string (UUID) | Unique identifier for this graph document. |
title |
string | Title of the source document. |
summary |
string | LLM-generated summary of the full document. |
url |
string | URL or file path of the source document. |
language |
string | ISO 639-1 language code of the source text (e.g. en). |
schema |
object | Discovered entity/relation schema — see Schema. |
stats |
array | One entry per extraction run — see Stats. |
graph |
object | Consolidated graph of entities, relations, and communities — see Graph. |
chunks |
array | Source text chunks with per-chunk extraction results — see Chunks. |
{
"id": "8a2f11fe-28f8-4e93-9757-332fefefae88",
"title": "Universal Declaration of Human Rights",
"summary": "The UDHR proclaims that all human beings possess inherent dignity...",
"url": "https://knwler.com/pdfs/HumanRights.pdf",
"language": "en",
"schema": { ... },
"stats": [ ... ],
"graph": { ... },
"chunks": [ ... ]
}Schema
The schema object captures the entity and relation taxonomy that Knwler discovered (or was given) for this document.
| Field | Type | Description |
|---|---|---|
entity_types |
string[] | List of entity type labels used across the graph. |
relation_types |
string[] | List of relation type labels used across the graph. |
reasoning |
string | Explanation of why this particular schema was chosen. |
"schema": {
"entity_types": ["human_right", "freedom", "legal_document", "article", ...],
"relation_types": ["grants_right_to", "protects", "limits", ...],
"reasoning": "The UDHR text centers on human rights and freedoms..."
}Graph
The graph object contains three arrays that together represent the consolidated knowledge graph.
"graph": {
"entities": [ ... ],
"relations": [ ... ],
"communities": [ ... ]
}Entities
Each item in graph.entities represents a named concept extracted from the document.
| Field | Type | Description |
|---|---|---|
id |
string (UUID) | Unique identifier for this entity. |
name |
string | The entity’s canonical name. |
type |
string | Entity type from the schema (e.g. legal_document, person). |
description |
string | LLM-generated description of the entity. |
chunk_ids |
string[] | IDs of the chunks where this entity appears. |
community_id |
integer | ID of the community this entity belongs to. |
{
"id": "5cca91e1-f3ae-48fd-b6d5-e2e807b45df3",
"name": "Universal Declaration of Human Rights",
"type": "legal_document",
"description": "A proclamation by the General Assembly establishing a common standard...",
"chunk_ids": ["9454ce7f-55db-478b-982f-ccf7defce449"],
"community_id": 0
}Relations
Each item in graph.relations represents a directed relationship between two entities.
| Field | Type | Description |
|---|---|---|
id |
string (UUID) | Unique identifier for this relation. |
source |
string | Name of the source entity. |
source_type |
string | Type of the source entity. |
target |
string | Name of the target entity. |
target_type |
string | Type of the target entity. |
type |
string | Relation type from the schema (e.g. protects, is_part_of). |
description |
string | LLM-generated description of this relationship. |
strength |
number | Confidence/salience score, typically 1–10. |
chunk_ids |
string[] | IDs of the chunks where this relation was found. |
{
"id": "cd593bcf-8b92-45eb-a5b2-4680440786e2",
"source": "Universal Declaration of Human Rights",
"source_type": "legal_document",
"target": "human rights",
"target_type": "human_right",
"type": "protects",
"description": "The Declaration protects and establishes human rights as a common standard.",
"strength": 9.0,
"chunk_ids": ["9454ce7f-55db-478b-982f-ccf7defce449"]
}Communities
Communities are clusters of closely related entities discovered via graph community detection.
| Field | Type | Description |
|---|---|---|
id |
integer | Community identifier (matches community_id on entities). |
topics |
string[] | Short topic labels summarising the community’s theme. |
description |
string | LLM-generated description of the community. |
members |
string[] | Member entity references in "name::type" format. |
{
"id": 0,
"topics": ["Universal Human Rights", "UN Framework", "Fundamental Freedoms"],
"description": "The foundational community establishing the UDHR...",
"members": [
"General Assembly::organization",
"human rights::human_right",
"rule of law::legal_principle"
]
}Chunks
The chunks array contains one entry per text segment that was processed during extraction. Each chunk preserves both the original text and the entities/relations found locally within it.
| Field | Type | Description |
|---|---|---|
id |
string (UUID) | Unique identifier for this chunk (referenced by chunk_ids elsewhere). |
chunk_idx |
integer | Zero-based sequential index of the chunk. |
text |
string | The original source text segment. |
rephrase |
string | A plain-language rephrase of the text, generated by the LLM. |
entities |
array | Entities extracted from this chunk (same fields as graph entities, without id and community_id). |
relations |
array | Relations extracted from this chunk (same fields as graph relations, without id). |
{
"id": "9454ce7f-55db-478b-982f-ccf7defce449",
"chunk_idx": 0,
"text": "Universal Declaration of Human Rights\nPreamble\n...",
"rephrase": "Why This Declaration Matters: All people have equal and basic human rights...",
"entities": [ ... ],
"relations": [ ... ]
}Stats
The stats array holds one object per extraction run, recording timing, token usage, and configuration. Multiple entries accumulate if the document is re-processed.
| Field | Type | Description |
|---|---|---|
num_chunks |
integer | Number of text chunks processed. |
schema_discovery_time |
number | Seconds spent on schema discovery. |
extraction_wall_time |
number | Wall-clock seconds for the extraction phase. |
consolidation_time |
number | Seconds spent merging per-chunk results. |
total_cpu_time |
number | Total CPU seconds consumed. |
total_time |
number | End-to-end elapsed seconds. |
parallelism |
number | Effective parallelism ratio during extraction. |
throughput_tps |
number | Tokens per second throughput. |
time |
object | Per-chunk timing stats: min, max, avg, median. |
tokens |
object | Per-chunk token stats: min, max, avg, total. |
entities |
object | Entity counts: total, avg per chunk. |
relations |
object | Relation counts: total, avg per chunk. |
community_detection_time |
number | Seconds spent on community detection. |
communities |
object | Community stats: count, singletons, largest, size (min/max/avg/median). |
run |
object | Execution context — see below. |
The nested run object captures the configuration used:
| Field | Type | Description |
|---|---|---|
timestamp |
string | ISO-style datetime of the run. |
file |
string | Path to the input file. |
backend |
string | LLM backend used (e.g. anthropic, openai, ollama). |
extraction_model |
string | Model used for chunk-level extraction. |
discovery_model |
string | Model used for schema discovery. |
max_tokens |
integer | Maximum tokens per chunk. |
overlap_tokens |
integer | Token overlap between consecutive chunks. |
max_concurrent |
integer | Maximum concurrent LLM requests. |
use_cache |
boolean | Whether the LLM response cache was enabled. |
no_discovery |
boolean | Whether schema discovery was skipped. |