tl;dr
We’re introducing a first-of-a-kind AI chunking mode to the semchunk semantic chunking algorithm leveraging our recently released enrichment and hierarchical segmentation model, Kanon 2 Enricher.
On Legal RAG QA, semchunk’s AI chunking mode delivers a 6% increase in RAG correctness over its non-AI chunking mode, 8% over LangChain’s recursive chunking algorithm, 12% over naïve fixed-size chunking, and 15% over chonkie’s recursive and embedding-powered chunking modes, demonstrating the significant impact choice of chunking algorithm can have on downstream RAG performance.
To get started integrating our new AI chunking mode into your own applications, you can install the latest version of semchunk by following the instructions in our README.
Why chunking matters
Retrieval-augmented generation or RAG is when you feed an LLM new information from outside sources in order to ground its responses.
Because LLMs have a finite number of tokens that they can process in a single call, RAG systems often need to break large inputs up into smaller parts and only feed the most relevant parts to LLMs, thereby avoiding overloading them with information.
For example, if you ask a legal RAG system a question like, ‘What is the punishment for drink driving in California?’, an LLM is unlikely to be fed the entire California Vehicle Code at once. Instead, a backend retrieval engine will end up giving the LLM the specific section of the statute governing drink driving. This not only saves time by avoiding the LLM needing to process a massive amount of information but also improves accuracy by allowing an LLM to focus on the specific information at issue.
The process of breaking large inputs up into smaller parts is called chunking.
Chunking can be done simply by breaking up an input at a fixed interval, say, at every 100 words. This approach is extremely fast and, in some cases, works fairly well. A key downside, however, is that you often end up breaking right in the middle of important context.
The sentence, ‘Today, you may have won; I’ll win tomorrow’, when broken up into the 4-word chunks, ‘Today, you may have’ and ‘won; I’ll win tomorrow’, yields chunks that individually don’t convey as much information as they could. Assuming a 4-word maximum chunk size, a better split would be ‘Today’, ‘you may have won’, and ‘I’ll win tomorrow’. Although technically less space-efficient (as now there are three instead of two chunks), those three chunks end up individually conveying an extra piece of information, that ‘you may have won’ as well as that ‘I’ll win tomorrow’.
When LLMs are context-constrained and exposed to complex user queries demanding highly precise search results, unexpected interruptions in the middle of important context can mean the difference between locating the right answer and hallucinating a false answer. Indeed, as we show later, in extreme cases, optimal correctness can be reduced by double digits when adopting fixed-size chunking.
Consequently, a new class of chunking approaches has emerged known as ‘semantic chunking’. These approaches, although varied in sophistication, share the same goal: to maximize the amount of semantic information contained within any given chunk while minimizing interruptions in the middle of important context. As we show later in this blog post, many semantic chunking algorithms are able to deliver material improvements to RAG correctness simply by improving the semantic meaningfulness of chunks.
How it works
semchunk
semchunk is a semantic chunking algorithm that exploits common typographical patterns in predominantly Latin-script documents in order to preserve syntactic and semantic divisions within chunks.
semchunk works by recursively splitting texts at increasingly structurally granular splitter sequences until all chunks are less than or equal to a given chunk size. More specifically, as of version 4.0.0, semchunk:
- splits text into chunks using the least structurally granular splitter sequence possible;
- recursively splits chunks exceeding the chunk size until a set of chunks less than or equal to the given chunk size is reached;
- merges chunks under the chunk size into a single chunk until that chunk equals the chunk size or adding a subsequent chunk would cause it to exceed the chunk size;
- reattaches any non-whitespace splitter sequences back to the ends of chunks, barring the final chunk if doing so does not bring a chunk over the chunk size, otherwise adding non-whitespace splitters as their own chunks; and
- filters out chunks consisting entirely of whitespace.
semchunk uses the following splitters, in order of precedence from least to most structurally granular:
- the largest sequence of newlines and/or carriage returns;
- the largest sequence of tabs;
- the largest sequence of whitespace characters or, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the non-whitespace characters listed below, then only those specific whitespace characters, for each such non-whitespace character in the same order of precedence as they are listed;
- sentence terminators (in order of precedence: .?!);
- clause separators (in order of precedence: ;, ()[]“”‘’'”`*);
- sentence interrupters (in order of precedence: :—…);
- word joiners (in order of precedence: /\–&-); and
- all other characters.
By prioritizing whitespace ahead of punctuation marks used to terminate sentences such as periods and question marks, which in turn are prioritized ahead of punctuation typically used to separate dependent clauses such as colons and asterisks, semchunk often ends up producing chunks that keep whole paragraphs and sentences intact, all while remaining relatively computationally inexpensive.
Where semchunk struggles is with respecting transtextual structures such as sections within a statute, chapters within a book, and even simple headings. A workaround could be to extend semchunk’s taxonomy of structurally important splitters to detect transtextual structures, however, modeling all such structures using just heuristics is non-trivial.
This is where AI chunking and Kanon 2 Enricher come in.
Kanon 2 Enricher
Kanon 2 Enricher is an enrichment and hierarchical graphitization model. It transforms unstructured documents into rich, highly structured, hierarchical knowledge graphs represented in the Isaacus Legal Graph Schema (ILGS). Given an arbitrary document, Kanon 2 Enricher will extract references to entities within it (e.g., people, organizations, locations, dates, citations, defined terms, etc.), annotate it (e.g., headings, tables, signatures, cross-references, etc.), and segment it into its full hierarchical structure (e.g., chapters, divisions, sections, clauses).
Unlike a generative model, Kanon 2 Enricher doesn’t generate annotations token by token but instead directly annotates all of the tokens in a document into ILGS in a single shot. This makes it architecturally incapable of hallucinating new content or otherwise producing schematically invalid graphs, all while remaining orders of magnitude faster than your typical LLM.
Uniquely, all of the textual annotations produced by Kanon 2 Enricher are globally well-nested and laminar. No two spans in an ILGS graph, whether they correspond to a person’s name or an entire chapter, can partially overlap. They can only contain or be contained within one another, be disjoint, or be directly adjacent.
This constraint means that, implicitly, after collapsing duplicate spans, all of the spans in an ILGS graph form an ordered, rooted containment tree roughly corresponding to the textual, transtextual, and semantic hierarchical structure of a document.
AI chunking
semchunk’s AI chunking mode uses Kanon 2 Enricher to derive spans corresponding to structural elements in a document that are then iteratively merged and recursively decomposed, producing chunks not exceeding a specified chunk size.
Concretely, semchunk’s AI chunking mode:
- splits text into pre-chunks up to 1,000,000 characters in length using semchunk’s non-AI chunking mode in order to avoid sending excessively long inputs to an enrichment model;
- enriches the resulting pre-chunks with the enrichment model, accumulating a deduplicated set of annotated spans from enriched pre-chunks;
- for each level of depth in the implicit tree formed by the spans, creates new spans where necessary to ensure that all content at that level of depth is covered by a span;
- constructs an explicit global tree of spans based on their containment;
- iterates through the span tree:
- merges spans into chunks until the chunk size is reached or would be exceeded by adding a new span;
- filters out whitespace-only chunks;
- removes leading and trailing whitespace from chunks;
- recursively enters into the children of spans where a span exceeds the chunk size, iterating through these steps in order to arrive at a set of chunks not exceeding the chunk size; and
- falls back to semchunk’s non-AI chunking mode where a span has no children.
This algorithm closely mirrors semchunk’s standard chunking mode, with the key difference being that spans extracted by Kanon 2 Enricher are used to produce candidate splits rather than character-based heuristics.
How we evaluate chunking
Although chunking is technically a special case of segmentation, it differs from most other types of segmentation in that an artificial, asemantic context constraint is imposed on the lengths of splits. What makes a split a good segment is, therefore, not necessarily the same as what makes a split a good chunk. Yet past evaluations of chunking algorithms have tended to treat chunking as just another ordinary type of segmentation by bootstrapping existing segmentation benchmarks like Wiki-727K.
Given that chunking is ultimately a means to an end rather than an end in and of itself, we have taken a different approach, opting to evaluate the final impact of chunking algorithms on the downstream correctness of a RAG system.
Concretely, we assess the correctness of semchunk’s AI and non-AI chunking modes, LangChain’s recursive chunking algorithm, chonkie’s recursive and embedding-powered chunking modes, and fixed-size chunking on Legal RAG QA.
Legal RAG QA is a reasoning-intensive benchmark for evaluating the end-to-end performance of RAG systems. It consists of 190 passages and external materials and 138 question–answer–relevant-passages triplets sourced from LibreTexts’ Introduction to Criminal Law textbook. Legal RAG QA originates in an early version of our recently released Legal RAG Bench benchmark that was eventually shelved in favor of a different, larger dataset. Both benchmarks are equally high quality, however.
We select a legal-domain RAG benchmark because legal RAG applications are uniquely sensitive to data and retrieval quality. Unlike with scientific applications, the background knowledge and reasoning faculties of an LLM, however wide and advanced as they may be, often cannot compensate for a fundamental inability to access up-to-date legal information. Statute and case law change every day, and such changes cannot be known through reasoning alone.
We select Legal RAG QA, in particular, because, unlike Legal RAG Bench, which is already chunked, passages in Legal RAG QA range widely in length, from 12 to 67,839 tokens.
To minimize confounding variables, we evaluate all chunking algorithms on Legal RAG QA using the same barebones LangChain-based RAG pipeline without modifying any hyperparameters from their default values.
We select two frontier production models as our retrieval and generative engines, Kanon 2 Embedder and Gemini 3 Pro, respectively. We use Gemini 3 Pro to both generate answers and separately score their ‘correctness’. A response is correct if it entails the right answer and is incorrect if it does not.
We use a chunk size of 256 tokens as determined by the Kanon 2 tokenizer in order to simulate a context-scarce environment where chunking is likely to matter most. Simultaneously, we supply our generative model with the single most relevant passage retrieved by our embedding model in order to minimize any possibility of our embedding model compensating for poor chunking by effectively restitching interrupted context together.
How semchunk performs
Our evaluation reveals that choice of chunking algorithm can have a significant impact on RAG correctness. The best-performing chunking algorithm, semchunk’s AI chunking mode, achieves 15.6% higher correctness than the worst performing chunking algorithm, chonkie.
Notably, chonkie’s recursive- and embedding-based chunking modes perform exactly the same, resulting in 2.1% lower correctness than fixed-size chunking.
Conversely, LangChain’s recursive chunking algorithm achieves 4.5% better correctness than fixed-size chunking, with vanilla semchunk in turn outperforming LangChain by 2%, and semchunk’s AI chunking mode outperforming vanilla semchunk by 6.2%.
Below we present an example of how a chunking failure can lead to a RAG failure. Given a question about the outcome of a particular case, our retrieval engine retrieves two different relevant chunks, one, generated by LangChain, is missing a critical line stating the court’s finding while the other, generated by semchunk’s AI chunking mode, preserves the line intact with its preceding context, enabling Gemini 3 Pro to correctly answer the question.
Question
Read Commonwealth v. Casanova, 429 Mass. 293 (1999). In Casanova, the defendant shot the victim in 1991, paralyzing him. The defendant was convicted of assault with intent to murder and two firearms offenses. In 1996, the victim died. The defendant was thereafter indicted for his murder. Massachusetts had abolished the year and a day rule in 1980. Did the Massachusetts Supreme Judicial Court uphold the indictment, or did the court establish a new death timeline rule?
Retrieved context
Retrieved context
Pushing the frontier
semchunk’s new AI chunking mode represents a major leap forward for semantic chunking and RAG in general. On Legal RAG QA, it delivers an increase of 6% in RAG correctness over the next best approach, vanilla semchunk, and 12% over fixed-size chunking. These findings highlight not only the potential of AI-powered chunking for improving RAG performance but also the detrimental effect poor chunking can have on RAG systems, particularly in context-constrained environments.
Moving forward, we will continue to improve semchunk and our enrichment and hierarchical segmentation capabilities, thereby extending the frontier in RAG and context engineering for all. The best way to stay updated is to sign up for our platform and then follow us on LinkedIn, Twitter/X, or Reddit.