TextSplitter

class hana_ml.text.text_splitter.TextSplitter(chunk_size=None, overlap=None, strip_whitespace=None, keep_separator=None, thread_ratio=None, split_type=None, doc_type=None, language=None, separator=None)

For a long text, it may be necessary to transform it to better suit. The text chunking procedure provides methods to split a long text into smaller chunks that can fit into a specific model's context window.

At a high level, text splitters work as follows:

  1. Split the text into small, semantically meaningful chunks (often sentences).

  2. Combine these small chunks into a larger chunk until you reach a certain size (as measured by some function).

  3. When it reaches that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

The splitting methods are as follows:

  • Character splitter: Split the text based on a character, even if it splits a whole word into two chunks.

  • Recursive: Recursive chunking based on a list of separators.

  • Document: Various chunking methods for different document types (PlainText, HTML) and different languages (English, Chinese, Japanese, German, French, Spanish, Portuguese).

Character Splitting

Character splitting is the most basic form of splitting up the text. It is the process of simply dividing the text into N-character sized chunks regardless of their content or form.

Recursive Character Text Splitting

The problem with the Character splitter is that it does not take into account the structure of our document at all. It simply splits by a fixed number of characters.

The Recursive Character Text Splitter helps with this. It specifies a series of separators which will be used to split the text. The separators are as follows or you can customize them:

  • "\n\n" - Double new line, or most commonly paragraph breaks

  • "\n" - New lines

  • " " - Spaces

  • "" - Characters

Document Specific Splitting

This Splitting is all about making your chunking strategy fit your different data formats or languages.

The PlainText and HTML splitters will be similar to Recursive Character, but with different separators or different languages.

PlainText with English

  • "\n\n",

  • "\n",

  • " ".

PlainText with Chinese

  • "\n\n",

  • "\n",

  • "。",

  • " ".

Parameters:
chunk_sizeint, optional

Maximum size of chunks to return.

Defaults to 30.

overlapint, optional

Overlap in characters between chunks.

Defaults to 0.

strip_whitespacebool, optional

Whether to strip whitespace from the start and end of every chunk.

Defaults to False.

keep_separatorbool, optional

Whether to keep the separator and where to place it in each corresponding chunk.

Defaults to False.

thread_ratiofloat, optional

The ratio of available threads for multi-thread task:

  • 0: single thread.

  • 0–1: uses the specified percentage of available threads. PAL uses all available threads if the number is 1.0.

Defaults to 1.0.

split_typestr, optional

Configuration for spliting type of all elements:

  • 'char': character splitting.

  • 'recursive': recursive splitting.

  • 'document': document splitting.

Defaults to 'recursive'.

doc_typestr, optional

Configuration for document type of all elements:

  • 'plain': plain text.

  • 'html': html text.

Only valid when the split_type is the 'document' splitter.

Defaults to 'plain'.

languagestr, optional

Configuration for language of all elements:

  • 'auto': auto detect.

  • 'en': English.

  • 'zh': Chinese.

  • 'ja': Japanese.

  • 'de': German.

  • 'fr': French.

  • 'es': Spanish.

  • 'pt': Portuguese.

Only valid when the split_type is the 'document' splitter and doc_type is 'plain'.

Defaults to 'auto'.

separatorstr, optional

Configuration for splitting separators of all elements.

No default value.

Examples

>>> textsplitter = TextSplitter(chunk_size=300)
>>> res = textsplitter.split_text(data)
>>> print(res.collect())
>>> print(textsplitter.statistics_.collect())
Attributes:
statistics_DataFrame

Statistics.

Methods

split_text(data[, order_status, ...])

Split the text into smaller chunks and return the result.

split_text(data, order_status=False, specific_split_type=None, specific_doc_type=None, specific_language=None, specific_separator=None)

Split the text into smaller chunks and return the result.

Parameters:
dataDataFrame

The input data, structured as follows:

  • ID: type VARCHAR, NVARCHAR, INTEGER, the text id.

  • TEXT: type VARCHAR, NVARCHAR, NCLOB, the text content.

order_statusbool, optional

Specifies whether or not to order the text chunks generated by the splitter.

Defaults to False.

specific_split_typedict, optional

Specifies the split type (different from the global split type) for specific text elements in a dict, where keys are for document IDs and values should be valid split types.

Defaults to None.

specific_doc_typedict, optional

Specifies the doc type (different from the global doc type) for specific text elements in a dict, where keys are for document IDs and values should be valid doc types.

Defaults to None.

specific_languagedict, optional

Specifies the language (different from the global language) for specific text elements in a dict, where keys are for document IDs and values should be valid language abberviations supported by the algorithm.

Defaults to None.

specific_separatordict, optional

Specifies the separators (different from the global separator) for specific text elements in a dict, where keys are for document IDs and values should be valid separators.

Defaults to None.

Returns:
DataFrame

The result of the text split.

Inherited Methods from PALBase

Besides those methods mentioned above, the TextSplitter class also inherits methods from PALBase class, please refer to PAL Base for more details.