PALEmbeddings¶

class hana_ml.text.pal_embeddings.PALEmbeddings(model_version=None, max_token_num=None, pca_dim_num=None)¶

Embeds input documents into vectors.

Parameters

model_version: {'SAP_NEB.20240715', 'SAP_GXY.20250407'}, optional

Model version to use. If None, defaults to 'SAP_NEB.20240715'.

Options:

'SAP_NEB.20240715'
'SAP_GXY.20250407'

Defaults to None (uses 'SAP_NEB.20240715' by default).

max_token_num: int, optional

Maximum number of tokens per document depends on the embedding model.

pca_dim_num: int, optional

If set, applies PCA to reduce the dimensionality of the embeddings to the specified number.

Attributes

result_DataFrame: The embedding result.
stat_DataFrame: The statistics.

Methods

fit_transform(data, key, target[, ...])

Embed input documents into vectors.

Examples

Suppose you have a HANA DataFrame df with columns 'ID' and 'TEXT'. To embed the documents into vectors, create a PALEmbeddings instance and call fit_transform:

>>> from hana_ml.text.pal_embeddings import PALEmbeddings
>>> embedder = PALEmbeddings(model_version='SAP_GXY.20250407')
>>> result = embedder.fit_transform(data=df, key='ID', target='TEXT')
>>> # The result is a DataFrame with the original data and embedding columns
>>> print(result.collect())

You can also embed multiple text columns at once if you have more than one text column:

>>> embedder = PALEmbeddings(model_version='SAP_GXY.20250407')
>>> result = embedder.fit_transform(data=df, key='ID', target=['TEXT1', 'TEXT2'])
>>> print(result.collect())

fit_transform(data, key, target, thread_number=None, batch_size=None, is_query=None, max_token_num=None)¶

Embed input documents into vectors.

Parameters

data: DataFrame

Input data containing the documents to embed.

key: str

Name of the key column.

target: str or list of str

Name(s) of the text column(s) to embed.

thread_number: int, optional

Number of HTTP connections to the backend embedding service (1-10).

Defaults to 6.

batch_size: int, optional

Number of documents batched per request (1-50).

Defaults to 10.

is_query: bool, optional

If True, use query embedding for Asymmetric Semantic Search.

Defaults to False.

max_token_num: int, optional

Maximum number of tokens per document depends on the embedding model.

'SAP_NEB.20240715': 1024 (default is 256 if not set)
'SAP_GXY.20250407': 1024 (default is 512 if not set)

If max_token_num is not set, the default value for the selected model version will be used. Defaults to None (uses the default value of the selected embedding model).

Returns

DataFrame: DataFrame containing the original data and embedding columns.

Inherited Methods from PALBase¶

Besides those methods mentioned above, the PALEmbeddings class also inherits methods from PALBase class, please refer to PAL Base for more details.