PALEmbeddings¶
- class hana_ml.text.pal_embeddings.PALEmbeddings(model_version=None, max_token_num=None, pca_dim_num=None)¶
Embeds input documents into vectors.
- Parameters
- model_version: {'SAP_NEB.20240715', 'SAP_GXY.20250407'}, optional
Model version to use. If None, defaults to 'SAP_NEB.20240715'.
Options:
'SAP_NEB.20240715'
'SAP_GXY.20250407'
Defaults to None (uses 'SAP_NEB.20240715' by default).
- max_token_num: int, optional
Maximum number of tokens per document depends on the embedding model.
- pca_dim_num: int, optional
If set, applies PCA to reduce the dimensionality of the embeddings to the specified number.
- Attributes
- result_DataFrame
The embedding result.
- stat_DataFrame
The statistics.
Methods
fit_transform(data, key, target[, ...])Embed input documents into vectors.
Examples
Suppose you have a HANA DataFrame df with columns 'ID' and 'TEXT'. To embed the documents into vectors, create a PALEmbeddings instance and call fit_transform:
>>> from hana_ml.text.pal_embeddings import PALEmbeddings >>> embedder = PALEmbeddings(model_version='SAP_GXY.20250407') >>> result = embedder.fit_transform(data=df, key='ID', target='TEXT') >>> # The result is a DataFrame with the original data and embedding columns >>> print(result.collect())
You can also embed multiple text columns at once if you have more than one text column:
>>> embedder = PALEmbeddings(model_version='SAP_GXY.20250407') >>> result = embedder.fit_transform(data=df, key='ID', target=['TEXT1', 'TEXT2']) >>> print(result.collect())
- fit_transform(data, key, target, thread_number=None, batch_size=None, is_query=None, max_token_num=None)¶
Embed input documents into vectors.
- Parameters
- data: DataFrame
Input data containing the documents to embed.
- key: str
Name of the key column.
- target: str or list of str
Name(s) of the text column(s) to embed.
- thread_number: int, optional
Number of HTTP connections to the backend embedding service (1-10).
Defaults to 6.
- batch_size: int, optional
Number of documents batched per request (1-50).
Defaults to 10.
- is_query: bool, optional
If True, use query embedding for Asymmetric Semantic Search.
Defaults to False.
- max_token_num: int, optional
Maximum number of tokens per document depends on the embedding model.
'SAP_NEB.20240715': 1024 (default is 256 if not set)
'SAP_GXY.20250407': 1024 (default is 512 if not set)
If
max_token_numis not set, the default value for the selected model version will be used. Defaults to None (uses the default value of the selected embedding model).
- Returns
- DataFrame
DataFrame containing the original data and embedding columns.