Graph JSON Format

Knwler extracts a JSON file which you can use for various downstream tasks. The format is deliberately kept simple and is structured as follows.

Top-level fields

Field Type Description
id string (UUID) Unique identifier for this graph document.
title string Title of the source document.
summary string LLM-generated summary of the full document.
url string URL or file path of the source document.
language string ISO 639-1 language code of the source text (e.g. en).
schema object Discovered entity/relation schema — see Schema.
stats array One entry per extraction run — see Stats.
graph object Consolidated graph of entities, relations, and communities — see Graph.
chunks array Source text chunks with per-chunk extraction results — see Chunks.
{
  "id": "8a2f11fe-28f8-4e93-9757-332fefefae88",
  "title": "Universal Declaration of Human Rights",
  "summary": "The UDHR proclaims that all human beings possess inherent dignity...",
  "url": "https://knwler.com/pdfs/HumanRights.pdf",
  "language": "en",
  "schema": { ... },
  "stats": [ ... ],
  "graph": { ... },
  "chunks": [ ... ]
}

Schema

The schema object captures the entity and relation taxonomy that Knwler discovered (or was given) for this document.

Field Type Description
entity_types string[] List of entity type labels used across the graph.
relation_types string[] List of relation type labels used across the graph.
reasoning string Explanation of why this particular schema was chosen.
"schema": {
  "entity_types": ["human_right", "freedom", "legal_document", "article", ...],
  "relation_types": ["grants_right_to", "protects", "limits", ...],
  "reasoning": "The UDHR text centers on human rights and freedoms..."
}

Graph

The graph object contains three arrays that together represent the consolidated knowledge graph.

"graph": {
  "entities": [ ... ],
  "relations": [ ... ],
  "communities": [ ... ]
}

Entities

Each item in graph.entities represents a named concept extracted from the document.

Field Type Description
id string (UUID) Unique identifier for this entity.
name string The entity’s canonical name.
type string Entity type from the schema (e.g. legal_document, person).
description string LLM-generated description of the entity.
chunk_ids string[] IDs of the chunks where this entity appears.
community_id integer ID of the community this entity belongs to.
{
  "id": "5cca91e1-f3ae-48fd-b6d5-e2e807b45df3",
  "name": "Universal Declaration of Human Rights",
  "type": "legal_document",
  "description": "A proclamation by the General Assembly establishing a common standard...",
  "chunk_ids": ["9454ce7f-55db-478b-982f-ccf7defce449"],
  "community_id": 0
}

Relations

Each item in graph.relations represents a directed relationship between two entities.

Field Type Description
id string (UUID) Unique identifier for this relation.
source string Name of the source entity.
source_type string Type of the source entity.
target string Name of the target entity.
target_type string Type of the target entity.
type string Relation type from the schema (e.g. protects, is_part_of).
description string LLM-generated description of this relationship.
strength number Confidence/salience score, typically 1–10.
chunk_ids string[] IDs of the chunks where this relation was found.
{
  "id": "cd593bcf-8b92-45eb-a5b2-4680440786e2",
  "source": "Universal Declaration of Human Rights",
  "source_type": "legal_document",
  "target": "human rights",
  "target_type": "human_right",
  "type": "protects",
  "description": "The Declaration protects and establishes human rights as a common standard.",
  "strength": 9.0,
  "chunk_ids": ["9454ce7f-55db-478b-982f-ccf7defce449"]
}

Communities

Communities are clusters of closely related entities discovered via graph community detection.

Field Type Description
id integer Community identifier (matches community_id on entities).
topics string[] Short topic labels summarising the community’s theme.
description string LLM-generated description of the community.
members string[] Member entity references in "name::type" format.
{
  "id": 0,
  "topics": ["Universal Human Rights", "UN Framework", "Fundamental Freedoms"],
  "description": "The foundational community establishing the UDHR...",
  "members": [
    "General Assembly::organization",
    "human rights::human_right",
    "rule of law::legal_principle"
  ]
}

Chunks

The chunks array contains one entry per text segment that was processed during extraction. Each chunk preserves both the original text and the entities/relations found locally within it.

Field Type Description
id string (UUID) Unique identifier for this chunk (referenced by chunk_ids elsewhere).
chunk_idx integer Zero-based sequential index of the chunk.
text string The original source text segment.
rephrase string A plain-language rephrase of the text, generated by the LLM.
entities array Entities extracted from this chunk (same fields as graph entities, without id and community_id).
relations array Relations extracted from this chunk (same fields as graph relations, without id).
{
  "id": "9454ce7f-55db-478b-982f-ccf7defce449",
  "chunk_idx": 0,
  "text": "Universal Declaration of Human Rights\nPreamble\n...",
  "rephrase": "Why This Declaration Matters: All people have equal and basic human rights...",
  "entities": [ ... ],
  "relations": [ ... ]
}

Stats

The stats array holds one object per extraction run, recording timing, token usage, and configuration. Multiple entries accumulate if the document is re-processed.

Field Type Description
num_chunks integer Number of text chunks processed.
schema_discovery_time number Seconds spent on schema discovery.
extraction_wall_time number Wall-clock seconds for the extraction phase.
consolidation_time number Seconds spent merging per-chunk results.
total_cpu_time number Total CPU seconds consumed.
total_time number End-to-end elapsed seconds.
parallelism number Effective parallelism ratio during extraction.
throughput_tps number Tokens per second throughput.
time object Per-chunk timing stats: min, max, avg, median.
tokens object Per-chunk token stats: min, max, avg, total.
entities object Entity counts: total, avg per chunk.
relations object Relation counts: total, avg per chunk.
community_detection_time number Seconds spent on community detection.
communities object Community stats: count, singletons, largest, size (min/max/avg/median).
run object Execution context — see below.

The nested run object captures the configuration used:

Field Type Description
timestamp string ISO-style datetime of the run.
file string Path to the input file.
backend string LLM backend used (e.g. anthropic, openai, ollama).
extraction_model string Model used for chunk-level extraction.
discovery_model string Model used for schema discovery.
max_tokens integer Maximum tokens per chunk.
overlap_tokens integer Token overlap between consecutive chunks.
max_concurrent integer Maximum concurrent LLM requests.
use_cache boolean Whether the LLM response cache was enabled.
no_discovery boolean Whether schema discovery was skipped.