TextSplitter
- class hana_ml.text.text_splitter.TextSplitter(chunk_size=None, overlap=None, strip_whitespace=None, keep_separator=None, thread_ratio=None, split_type=None, doc_type=None, language=None, separator=None)
For a long text, it may be necessary to transform it to better suit. The text chunking procedure provides methods to split a long text into smaller chunks that can fit into a specific model's context window.
At a high level, text splitters work as follows:
Split the text into small, semantically meaningful chunks (often sentences).
Combine these small chunks into a larger chunk until you reach a certain size (as measured by some function).
When it reaches that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).
The splitting methods are as follows:
Character splitter: Split the text based on a character, even if it splits a whole word into two chunks.
Recursive: Recursive chunking based on a list of separators.
Document: Various chunking methods for different document types (PlainText, HTML) and different languages (English, Chinese, Japanese, German, French, Spanish, Portuguese).
Character Splitting
Character splitting is the most basic form of splitting up the text. It is the process of simply dividing the text into N-character sized chunks regardless of their content or form.
Recursive Character Text Splitting
The problem with the Character splitter is that it does not take into account the structure of our document at all. It simply splits by a fixed number of characters.
The Recursive Character Text Splitter helps with this. It specifies a series of separators which will be used to split the text. The separators are as follows or you can customize them:
"\n\n" - Double new line, or most commonly paragraph breaks
"\n" - New lines
" " - Spaces
"" - Characters
Document Specific Splitting
This Splitting is all about making your chunking strategy fit your different data formats or languages.
The PlainText and HTML splitters will be similar to Recursive Character, but with different separators or different languages.
PlainText with English
"\n\n",
"\n",
" ".
PlainText with Chinese
"\n\n",
"\n",
"。",
" ".
- Parameters:
- chunk_sizeint, optional
Maximum size of chunks to return.
Defaults to 30.
- overlapint, optional
Overlap in characters between chunks.
Defaults to 0.
- strip_whitespacebool, optional
Whether to strip whitespace from the start and end of every chunk.
Defaults to False.
- keep_separatorbool, optional
Whether to keep the separator and where to place it in each corresponding chunk.
Defaults to False.
- thread_ratiofloat, optional
The ratio of available threads for multi-thread task:
0: single thread.
0–1: uses the specified percentage of available threads. PAL uses all available threads if the number is 1.0.
Defaults to 1.0.
- split_typestr, optional
Configuration for spliting type of all elements:
'char': character splitting.
'recursive': recursive splitting.
'document': document splitting.
Defaults to 'recursive'.
- doc_typestr, optional
Configuration for document type of all elements:
'plain': plain text.
'html': html text.
Only valid when the
split_type
is the 'document' splitter.Defaults to 'plain'.
- languagestr, optional
Configuration for language of all elements:
'auto': auto detect.
'en': English.
'zh': Chinese.
'ja': Japanese.
'de': German.
'fr': French.
'es': Spanish.
'pt': Portuguese.
Only valid when the
split_type
is the 'document' splitter anddoc_type
is 'plain'.Defaults to 'auto'.
- separatorstr, optional
Configuration for splitting separators of all elements.
No default value.
Examples
>>> textsplitter = TextSplitter(chunk_size=300) >>> res = textsplitter.split_text(data) >>> print(res.collect()) >>> print(textsplitter.statistics_.collect())
- Attributes:
- statistics_DataFrame
Statistics.
Methods
split_text
(data[, order_status, ...])Split the text into smaller chunks and return the result.
- split_text(data, order_status=False, specific_split_type=None, specific_doc_type=None, specific_language=None, specific_separator=None)
Split the text into smaller chunks and return the result.
- Parameters:
- dataDataFrame
The input data, structured as follows:
ID: type VARCHAR, NVARCHAR, INTEGER, the text id.
TEXT: type VARCHAR, NVARCHAR, NCLOB, the text content.
- order_statusbool, optional
Specifies whether or not to order the text chunks generated by the splitter.
Defaults to False.
- specific_split_typedict, optional
Specifies the split type (different from the global split type) for specific text elements in a dict, where keys are for document IDs and values should be valid split types.
Defaults to None.
- specific_doc_typedict, optional
Specifies the doc type (different from the global doc type) for specific text elements in a dict, where keys are for document IDs and values should be valid doc types.
Defaults to None.
- specific_languagedict, optional
Specifies the language (different from the global language) for specific text elements in a dict, where keys are for document IDs and values should be valid language abberviations supported by the algorithm.
Defaults to None.
- specific_separatordict, optional
Specifies the separators (different from the global separator) for specific text elements in a dict, where keys are for document IDs and values should be valid separators.
Defaults to None.
- Returns:
- DataFrame
The result of the text split.
Inherited Methods from PALBase
Besides those methods mentioned above, the TextSplitter class also inherits methods from PALBase class, please refer to PAL Base for more details.