hana_ml.dataframe

This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly requested.

The following classes and functions are available:

ConnectionContext

DataFrame

quotename()

create_dataframe_from_pandas()

create_dataframe_from_spark()

create_dataframe_from_shapefile()

read_pickle()

melt()

import_csv_from()

hana_ml.dataframe.quotename(name)

Escapes a schema, table, or column name for use in SQL. hana_ml functions and methods that take schema, table, or column names already escape their input by default, but those that take SQL don't (and can't) perform escaping automatically.

Parameters:

namestr: The schema, table, or column name.

Returns:

str: The escaped name. The string is surrounded in quotation marks, and existing quotation marks are escaped by doubling them.

class hana_ml.dataframe.ConnectionContext(address='', port=0, user='', password=None, autocommit=True, packetsize=None, userkey=None, spatialtypes=1, encrypt=None, sslValidateCertificate=None, pyodbc_connection=None, abap_sql=False, sslKeyStore=None, **properties)

Bases: object

Represents a connection to a SAP HANA database instance.

ConnectionContext includes methods for creating DataFrames from data on the SAP HANA. DataFrames are tied to a ConnectionContext, and are unusable once their ConnectionContext is closed.

Parameters:

Same as hdbcli.dbapi.connect.: Please see the online docs for hdbcli.dbapi.connect for more details.

Examples

Querying data from SAP HANA into a Pandas DataFrame:

>>> with ConnectionContext('address', port, 'user', 'password') as cc:
...     df = (cc.table('MY_TABLE', schema='MY_SCHEMA')
...             .filter('COL3 > 5')
...             .select('COL1', 'COL2'))
...     pandas_df = df.collect()

The underlying hdbcli.dbapi.connect can be accessed if necessary:

>>> with ConnectionContext('127.0.0.1', 30215, 'MLGUY', 'manager') as cc:
...     cc.connection.setclientinfo('SOMEKEY', 'somevalue')
...     df = cc.sql('some sql that needs that session variable')
...     ...

Attributes:

connectionhdbcli.dbapi.connect: The underlying dbapi connection. Use this connection to run SQL directly, or to access connection methods like getclientinfo/setclientinfo.

enable_abap_sql(): Enables ABAP SQL.

disable_abap_sql(): Disables ABAP SQL.

close()

Closes the existing connection to a SAP HANA database instance.

Parameters:

None

Returns:

None

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.close()

add_primary_key(table, columns, schema=None)

Adds primary key to the existing table.

Parameters:

tablestr: Table name.
columnsstr of list: Columns to be primary key.
schemastr, optional: Schema name. If None, use the current schema.

copy(): Returns a new connection context.

create_schema(schema)

Creates a SAP HANA schema.

Parameters:

schemastr: Schema name.

create_table(table, table_structure, schema=None, table_type='COLUMN', prop='', data_lake=False, data_lake_container='SYSRDL#CG')

Creates a SAP HANA table.

Parameters:

tablestr

Table name.

table_structuredict

SAP HANA table structure. {Column name: Column type, ...}

schemastr, optional

Schema name. If None, use the current schema.

Defaults to None.

table_typestr, optional

Specify the table type.

'COLUMN', by default.
'ROW'.
'TEMPORARY'.

data_lakebool, optional

If True, create the data lake table by using SYSRDL#CG.REMOTE_EXECUTE().

Defaults to False.

data_lake_containerstr, optional

Name of the data lake container.

Defaults to 'SYSRDL#CG'.

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.create_table(table='TEST_TBL', table_structure={'test_name': 'VARCHAR(50)'})

create_virtual_table(table, data_lake_table=None, schema=None, data_lake_container='SYSRDL#CG', remote_source=None, remote_database='NULL', remote_schema=None, remote_table=None)

Creates a SAP virtual HANA table at remote/data lake source.

Parameters:

tablestr

HANA virtual table name.

data_lake_tablestr, optional

HANA data lake table name.

Mandatory if remote_table is None.

schemastr, optional

Schema name. If None, use the current schema.

Defaults to None.

data_lake_containerstr, optional

Name of the data lake container.

Defaults to 'SYSRDL#CG'.

remote_sourcestr, optional

Remote source where data of the target virtual table reside.

Mandatory and valid only if data_lake_table is None.

remote_databasestr, optional

Database of the remote source where data of the target virtual table reside.

Defaults to 'NULL', i.e. the default database.

Valid only if data_lake_table is None.

remote_schemastr, optional

The schema under which the corresponding remote_table of the target SAP HANA virtual table reside.

Required if data_lake_table is None.

Defaults to None.

remote_tablestr, optional

The table name in remote source where data of the target virtual table reside.

Mandatory and valid only when data_lake_table is None.

Defaults to None.

drop_procedure(proc, schema=None, drop_option=None)

Drops the specified view.

Parameters:

procstr

Procedure name.

schemastr, optional

Schema name. If None, use the current schema.

drop_option{None, 'CASCADE', 'RESTRICT'}, optional

Specifies the drop option to use.

Defaults to None.

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.drop_proc(proc='TEST_PROC')

drop_view(view, schema=None, drop_option=None)

Drops the specified view.

Parameters:

viewstr

View name.

schemastr, optional

Schema name. If None, use the current schema.

drop_option{None, 'CASCADE', 'RESTRICT'}, optional

Specifies the drop option to use.

Defaults to None.

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.drop_view(view='TEST_VIEW')

drop_table(table, schema=None, data_lake=False, data_lake_container='SYSRDL#CG', drop_option=None)

Drops the specified table.

Parameters:

tablestr

Table name.

schemastr, optional

Schema name. If None, use the current schema.

data_lakebool, optional

If True, drop the data lake table.

Defaults to False.

data_lake_containerstr, optional

Name of the data lake container.

Defaults to 'SYSRDL#CG'.

drop_option{None, 'CASCADE', 'RESTRICT'}, optional

Specifies the drop option to use.

Defaults to None.

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.drop_table(table='TEST_TBL')

copy_to_data_lake(data, virtual_table, data_lake_table, schema=None, append=False, data_lake_container='SYSRDL#CG')

Copies HANA data to a data lake table.

Parameters:

dataDataFrame

HANA DataFrame.

virtual_tablestr

HANA virtual table name.

data_lake_tablestr

HANA data lake table name.

schemastr, optional

Schema name. If None, use the current schema.

appendbool, optional

Append data to the existing data lake table.

Defaults to False.

data_lake_containerstr, optional

Name of the data lake container.

Defaults to 'SYSRDL#CG'.

explain_plan_statement(statement_name, subquery, force=True)

Evaluates the execution plan that the database follows when executing an SQL statement and return the result.

Parameters:

statement_namestr,

Specifies the name of a specific execution plan in the output table for a given SQL.

subquerystr,

Specifies the subquery to explain the plan for.

forcebool, optional

If force is True, it will delete existing result according to statement_name.

Defaults to True.

has_schema(schema)

Returns the boolean value for the schema existence.

Parameters:

schemastr: Schema name.

Returns:

bool: Table existence.

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.has_schema(schema='MY_SCHEMA')
True

has_table(table, schema=None)

Returns the boolean value for the table existence.

Parameters:

tablestr: Table name.
schemastr, optional: Schema name. If None, use the current schema.

Returns:

bool: Table existence.

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.has_table(table='TEST_TBL')
True

hana_version()

Returns the version of a SAP HANA database instance.

Parameters:

None

Returns:

str: The SAP HANA version.

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.hana_version()
'4.50.000.00.1581545459 (master)'

get_current_schema()

Returns the current schema name.

Parameters:

None

Returns:

str: The current schema name.

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.get_current_schema()
'TEST'

get_tables(schema=None)

Returns the tables list given schema.

Parameters:

schemastr, optional: The schema name. If no specified, use the current schema.

get_procedures(schema=None)

Returns the procedures list given schema.

Parameters:

schemastr, optional: The schema name. If no specified, use the current schema.

get_temporary_tables(schema=None, connection_id=None, list_other_connections=False)

Returns the temporary table list given schema.

Parameters:

schemastr, optional: The schema name. If no specified, use the current schema.
conneciont_idint, optional: If None, it returns the temporary tables from the current connection.
list_other_connectionsbool, optional: If True, it will also outputs the temporary tables from other connections.

get_connection_id(): Returns the connection id.

cancel_session_process(connection_id=None)

Cancels the current process in the given session. If the connection_id is not provided, it will use the current connection.

Parameters:

connection_idint, optional

Connection id.

Defaults to the current connection.

restart_session(connection_id=None)

Terminates the current session, drops all the temporary tables and starts a new one.

Parameters:

connection_idint, optional

Connection id.

Defaults to the current connection.

clean_up_temporary_tables(): Drops all the temporary table under the current schema.

hana_major_version()

Returns the major number of SAP HANA version.

Parameters:

None

Returns:

str: The major number of SAP HANA version.

Examples

cc is a connection to a SAP HANA database instance.

>>> cc.hana_major_version()
'4'

is_cloud_version()

Check whether the SAP HANA database instance is cloud version or on-premise.

Parameters:

None

Returns:

bool: If True, the HANA instance is cloud version.

sql(sql)

Returns a SAP HANA DataFrame representing a query.

Parameters:

sqlstr: SQL query. The last sentence must be select statement.

Returns:

DataFrame: The DataFrame for the query.

Examples

cc is a connection to a SAP HANA database instance.

>>> df = cc.sql('SELECT T.A, T2.B FROM T, T2 WHERE T.C=T2.C')

execute_sql(sql)

Multiline sql execution.

Parameters:

sqlstr or sql file: SQL query.

table(table, schema=None, save_source=True, view_params=None)

Returns a DataFrame that represents the specified table.

Parameters:

tablestr: The table name.
schemastr, optional, keyword-only: The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
save_sourcebool, optional: If True, save the name of source table. Defaults to True.
view_paramslist or tuple, optional: Parameters for view.

Returns:

DataFrame: The DataFrame that is selecting data from the specified table.

Examples

>>> df1 = cc.table('MY_TABLE')
>>> df2 = cc.table('MY_OTHER_TABLE', schema='MY_SCHEMA')

upsert_streams_data(table_name, key, data, schema=None)

This method will enable streams data to SAP HANA through SQL upsert if the provided data type contains bytes.

Parameters:

table_name: str

HANA table name to be upserted streams data.

key: str

The key column name.

data: dict

The keys of data are column names while the values are the data to upsert. If data contains bytes, the method will use HANA LOB streams method.

schema: str, optional

The schema name.

Defaults to the current schema.

Examples

>>> with open('image.png', 'rb') as f:
        img = f.read()
>>> with open('image2.png', 'rb') as f:
        img2 = f.read()
>>> conn.upsert_streams_data(table_name="LOB_STREAMING_TEST", key="id", data={"id":1, "img":img, "other":img2})

update_streams_data(table_name, key, data, schema=None)

This method will enable streams data to SAP HANA through SQL update if the provided data type contains bytes.

Parameters:

table_name: str

HANA table name to be updated streams data.

key: str

The key column name.

data: dict

The keys of data are column names while the values are the data to update. If data contains bytes, the method will use HANA LOB streams method.

schema: str, optional

The schema name.

Defaults to the current schema.

Examples

>>> with open('image.png', 'rb') as f:
        img = f.read()
>>> with open('image2.png', 'rb') as f:
        img2 = f.read()
>>> conn.update_streams_data(table_name="LOB_STREAMING_TEST", key="id", data={"id":1, "img":img, "other":img2})

to_sqlalchemy(): Returns a SQLAlchemy engine.

class hana_ml.dataframe.DataFrame(connection_context, select_statement, _name=None)

Bases: object

Represents a frame that is backed by a database SQL statement.

Parameters:

connection_contextConnectionContext: The connection to the SAP HANA database instance.
select_statementstr: The SQL query backing the dataframe.

Note

Parameters beyond connection_context and select_statement are intended for internal use. Do not rely on them; they may change without notice.

property columns

Lists the current DataFrame's column names. Computed lazily and cached. Each access to this property creates a new copy; mutating the list does not alter or corrupt the DataFrame.

Parameters:

None

Returns:

list of str: A list of column names.

Examples

df is a SAP HANA DataFrame.

>>> df.columns
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)',
 'target']

property shape

Computes the shape of the SAP HANA DataFrame.

Parameters:

None

Returns:

tuple: (The number of rows, the number of columns) in the SAP HANA DataFrame.

Examples

df is a SAP HANA DataFrame.

>>> df.shape
    (1, 3)

property name

Returns the name of the DataFrame. This value does not correspond to a SAP HANA table name. This value is useful for joining predicates when the joining DataFrames have columns with the same name.

Parameters:

None

Returns:

str: A str of DataFrame name.

Examples

df is a SAP HANA DataFrame.

>>> df.name
'DT_1'

property quoted_name

Specifies the escaped name of the original DataFrame. Default-generated DataFrame names are safe to use in SQL without escaping, but names set with DataFrame.alias may require escaping.

Parameters:

None

Returns:

str: A str of DataFrame name.

Examples

df is a SAP HANA DataFrame.

>>> df.quoted_name
'"DT_1"'

property description: Return cur.description from the select_statement query.

property description_ext: Return cur.description_ext() from the select_statement query.

declare_lttab_usage(usage)

Declares whether this DataFrame makes use of local temporary tables.

Some SAP HANA PAL execution routines can execute more efficiently if they know up front whether a DataFrame's SELECT statement requires access to local temporary tables.

Parameters:

usagebool: Specifies whether this DataFrame uses local temporary tables.

disable_validate_columns(): Disable the column validation.

enable_validate_columns(): Enable the column validation.

add_id(id_col=None, ref_col=None, starting_point=1)

Returns a new SAP HANA DataFrame with a added <id_col> column.

Parameters:

id_colstr, optional

The name of new ID column.

Defaults to "ID".

ref_colstr or list of str, optional

The id is generated based on ref_col.

starting_pointint, optional

The starting point of ID.

Defaults to 1.

Returns:

DataFrame: A new SAP HANA DataFrame with a added <id_col> column.

Examples

df is a SAP HANA DataFrame.

>>> df.collect()
    X    Y
0  20   30
1  40    5

>>> df.add_id(id_col='ID')
    ID    X    Y
0    1   20   30
1    2   40    5

add_constant(column_name, value)

Adds a new column with constant value.

Parameters:

columns_name: str: The name of column to be added.
value: str: The constant value to be added.

alias(alias)

Returns a new SAP HANA DataFrame with an alias set.

Parameters:

aliasstr: The name of the DataFrame.

Returns:

DataFrame: A SAP HANA DataFrame with an alias set.

See also

DataFrame.rename_columns: For renaming individual columns.

count()

Computes the number of rows in the SAP HANA DataFrame.

Parameters:

None

Returns:

int: The number of rows in the SAP HANA DataFrame.

Examples

df is a SAP HANA DataFrame.

>>> df.count()
150

diff(index, periods=1, diff_datetime='days')

Returns a new SAP HANA DataFrame with differenced values.

Parameters:

indexint or str

Index of the SAP HANA DataFrame.

periodsint, optional

Periods to shift for calculating difference, accepts negative values.

Defaults to 1.

diff_datetime{"years", "months", "days", "seconds", "nano100", "workdays"} or dict, optional

Specifies the difference type. It can also support diff_datetime for particular columns.

Defaults to "days".

Returns

-------

DataFrame

DataFrame with differenced values. No calculation happens if it contains string.

drop(cols)

Returns a new SAP HANA DataFrame without the specified columns.

Parameters:

colslist of str: The list of column names to be dropped.

Returns:

DataFrame: A new SAP HANA DataFrame that retains only the columns not listed in cols.

Examples

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> df.drop(['B']).collect()
   A
0  1
1  2

distinct(cols=None)

Returns a new SAP HANA DataFrame with distinct values for the specified columns. If no columns are specified, then the distinct row values from all columns are returned.

Parameters:

colsstr or list of str, optional: A column or list of columns to consider when getting distinct values. Defaults to use all columns.

Returns:

DataFrame: The DataFrame with distinct values for cols.

Examples

Input:

>>> df.collect()
   A  B    C
1  A  100
1  A  101
1  A  102
1  B  100
1  B  101
1  B  102
1  B  103
2  A  100
2  A  100

Distinct values in a column:

>>> df.distinct("B").collect()
   B
0  A
1  B

Distinct values of a subset of columns:

>>> df.distinct(["A", "B"]).collect()
   A  B
0  1  B
1  2  A
2  1  A

Distinct values of the entire data set:

>>> df.distinct().collect()
   A  B    C
1  A  102
1  B  103
1  A  101
2  A  100
1  B  101
1  A  100
1  B  100
1  B  102

drop_duplicates(subset=None)

Returns a new SAP HANA DataFrame with duplicate rows removed. All columns in the DataFrame are returned. There is no way to keep specific duplicate rows.

Warning

Specifying a non-None value of subset may produce an unstable DataFrame, the contents of which may be different every time you look at it. Specifically, if two rows are duplicates in their subset columns and have different values in other columns, Then a different row could be picked every time you look at the result.

Parameters:

subsetlist of str, optional: A list of columns to consider when deciding whether rows are duplicates of each other. Defaults to use all columns.

Returns:

DataFrame: A DataFrame with only one copy of duplicate rows.

Examples

Input:

>>> df.collect()
   A  B    C
1  A  100
1  A  101
1  A  102
1  B  100
1  B  101
1  B  102
1  B  103
2  A  100
2  A  100

Drop duplicates based on the values of a subset of columns:

>>> df.drop_duplicates(["A", "B"]).collect()
   A  B    C
0  1  A  100
1  1  B  100
2  2  A  100

Distinct values on the entire data set:

>>> df.drop_duplicates().collect()
   A  B    C
1  A  102
1  B  103
1  A  101
2  A  100
1  B  101
1  A  100
1  B  100
1  B  102

dropna(how=None, thresh=None, subset=None)

Returns a new DataFrame with NULLs removed.

Parameters:

how{'any', 'all'}, optional: If provided, 'any' eliminates rows with any NULLs, and 'all' eliminates rows that are entirely NULLs. If neither how nor thresh are provided, how defaults to 'any'.
threshint, optional: If provided, rows with fewer than thresh non-NULL values are dropped. You cannot specify both how and thresh.
subsetlist of str, optional: The columns to consider when looking for NULLs. Values in other columns are ignored, whether they are NULL or not. Defaults to all columns.

Returns:

DataFrame: A new SAP HANA DataFrame with a SELECT statement that removes NULLs.

Examples

Dropping rows with any NULL:

>>> df.collect()
     A    B    C
0  1.0  3.0  5.0
1  2.0  4.0  NaN
2  3.0  NaN  NaN
3  NaN  NaN  NaN
>>> df.dropna().collect()
     A    B    C
0  1.0  3.0  5.0

Dropping rows that are entirely nulls:

>>> df.dropna(how='all').collect()
     A    B    C
0  1.0  3.0  5.0
1  2.0  4.0  NaN
2  3.0  NaN  NaN

Dropping rows with less than 2 non-null values:

>>> df.dropna(thresh=2).collect()
     A    B    C
0  1.0  3.0  5.0
1  2.0  4.0  NaN

deselect(cols)

Returns a new DataFrame without columns derived from the current DataFrame.

Parameters:

colsstr or tuple/list of str.: The columns are excluded in the new DataFrame.

Returns:

DataFrame: A new DataFrame object excluding the specified columns.

Examples

Input:

>>> df.collect()
   A  B  C
0  1  2  3

Selecting a subset of existing columns:

>>> df.deselect(['A', 'B']).collect()
   C
0  3

has_constant_columns(): Returns a sequence of constant columns in the DataFrame.

drop_constant_columns(): Returns a DataFrame without constant columns.

dtypes(subset=None)

Returns a sequence of tuples describing the DataFrame's SQL types.

The tuples list the name, SQL type name, display_size, internal_size, precision and scale corresponding to the DataFrame's columns.

Parameters:

subsetlist of str, optional: The columns that the information is generated from. Defaults to all columns.

Returns:

dtypeslist of tuples: Each tuple consists of the name, SQL type name, display_size, internal_size, precision and scale for one of the DataFrame's columns. The list is in the order specified by the subset, or in the DataFrame's original column order if a subset is not provided.

empty()

Returns True if this DataFrame has 0 rows.

Parameters:

None

Returns:

bool: True if the DataFrame is empty.

Notes

If a DataFrame contains only NULLs, it is not considered empty.

Examples

>>> df1.collect()
Empty DataFrame
Columns: [ACTUAL, PREDICT]
Index: []
>>> df1.empty()
True

>>> df2.collect()
  ACTUAL PREDICT
0   None    None
>>> df2.empty()
False

filter(condition)

Selects rows that match the given condition.

Very little checking is done on the condition string. Use only with trusted inputs.

Parameters:

conditionstr or list: A filter condition. Format as SQL <condition>.

Returns:

DataFrame: A DataFrame with rows that match the given condition.

Raises:

hana_ml.ml_exceptions.BadSQLError: If comments or malformed tokens are detected in condition. May have false positives and false negatives.

Examples

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> df.filter('B < 4').collect()
   A  B
0  1  3

has(col)

Returns True if a column is in the DataFrame.

Parameters:

colstr: The name of column to search in the projection list of this DataFrame.

Returns:

bool: Returns True if the column exists in the DataFrame's projection list.

Examples

>>> df.columns
['A', 'B']
>>> df.has('A')
True
>>> df.has('C')
False

head(n=1)

Returns a DataFrame of the first n rows in the current DataFrame.

Parameters:

nint, optional

The number of rows returned.

Defaults to 1.

Returns:

DataFrame: A new DataFrame of the first n rows of the current DataFrame.

hasna(cols=None)

Returns True if a DataFrame contains NULLs.

Parameters:

colsstr or list of str, optional: A column or list of columns to be checked for NULL values. Defaults to all columns.

Returns:

bool: True if this DataFrame contains NULLs.

Examples

>>> df1.collect()
  ACTUAL PREDICT
0   1.0    None

>>> df1.hasna()
True

fillna(value, subset=None)

Returns a DataFrame with NULLs replaced with a specified value.

Parameters:

valueint or float: The value that replaces NULL. value should have a type that is appropriate for the selected columns.
subsetlist of str, optional: A list of columns whose NULL values will be replaced. Defaults to all columns.

Returns:

DataFrame: A new DataFrame with the NULL values replaced.

get_table_structure(): Returns dict format table structure.

join(other, condition=None, how='inner', select=None)

Returns a new DataFrame that is a join of the current DataFrame with another specified DataFrame.

Parameters:

otherDataFrame or list of DataFrame

The DataFrame to join with.

conditionstr, optional

The join predicate. If index has been set, use the index as key to join.

Defaults to None.

how{'inner', 'left', 'right', 'outer', 'cross'}, optional

The type of join. Defaults to 'inner'.

Defaults to 'inner'.

selectlist, optional

If provided, each element specifies a column in the result. A string in the select list should be the name of a column in one of the input DataFrames. A (expression, name) tuple creates a new column with the given name, computed from the given expression.

If this value is not provided, defaults to selecting all columns from both DataFrames, with the left DataFrame's columns first.

Returns:

DataFrame: A new DataFrame object made from the join() of the current DataFrame with another DataFrame.

Raises:

hana_ml.ml_exceptions.BadSQLError: If comments or malformed tokens are detected in condition or in a column expression. May have false positives and false negatives.

Examples

Use the expression selection functionality to disambiguate duplicate column names in a join():

>>> df1.collect()
   A  B    C
0  1  2  3.5
1  2  4  5.6
2  3  3  1.1

>>> df2.collect()
   A  B     D
0  2  1  14.0
1  3  4   5.6
2  4  3   0.0

Old method:

>>> df1.alias('L').join(df2.alias('R'), 'L.A = R.A', select=[
...     ('L.A', 'A'),
...     ('L.B', 'B_LEFT'),
...     ('R.B', 'B_RIGHT'),
...     'C',
...     'D']).collect()
   A  B_LEFT  B_RIGHT    C     D
0  2       4        1  5.6  14.0
1  3       3        4  1.1   5.6

New method:

>>> df1.set_index("A").join(df2.rename_columns({"B":"B2"}).set_index("A")).collect()
   A  B B2    C   D
0  2  4  1  5.6  14
1  3  3  4  1.1 5.6

set_name(name)

Sets the name of the DataFrame.

Parameters:

namestr: The name of dataframe.

set_index(keys)

Sets the index of the DataFrame.

Parameters:

keysstr or list of str: This parameter can be either a single column key or a list of column keys.

save(where, table_type=None, force=False, save_source=True, append=False, data_lake=False, data_lake_container='SYSRDL#CG', view_structure=None)

Creates a table or view holding the current DataFrame's data.

Parameters:

wherestr or (str, str) tuple

The table name or (schema name, table name) tuple. If no schema is provided, then the table or view is created in the current schema.

table_typestr, optional

The type of table to create. The value is case insensitive.

Permanent table options:

"ROW"

"COLUMN"

"HISTORY COLUMN"

Temporary table options:

"GLOBAL TEMPORARY"

"GLOBAL TEMPORARY COLUMN"

"LOCAL TEMPORARY"

"LOCAL TEMPORARY COLUMN"

Not a table:

"VIEW"

Defaults to 'LOCAL TEMPORARY COLUMN' if where starts with '#'. Otherwise, the default is 'COLUMN'.

forcebool, optional

If force is True, it will replace the existing table.

Defaults to False.

save_sourcebool, optional

If True, it will save the name of source table.

Defaults to True.

appendbool, optional

If True, it will use the existing table and append data to it.

Defaults to False.

data_lakebool, optional

If True, it will save the table to HANA data lake.

Defaults to False.

data_lake_containerstr, optional

Name of the data lake container.

Defaults to 'SYSRDL#CG'.

view_structuredict, optional

Define the parameters in the view. Only valid when table_type="VIEW".

Defaults to None.

Returns:

DataFrame: A DataFrame that represents the new table or view.

Note

For this operation to succeed, the table name must not be in use, the schema must exist, and the user must have permission to create tables (or views) in the target schema.

save_nativedisktable(where, force=False, save_source=True)

Materializes dataframe to a SAP HANA native disk.

Parameters:

wherestr or (str, str) tuple: The table name or (schema name, table name) tuple. If no schema is provided, then the table or view is created in the current schema.
forcebool, optional: If force is True, it will replace the existing table.
save_sourcebool, optional: If True, it will save the name of source table. Defaults to True.

Returns:

DataFrame: A DataFrame that represents the new table.

split_column(column, separator, new_column_names)

Returns a new DataFrame with splitted column.

Parameters:

columnstr: A column or list of columns to be splitted.
separatorstr: The separator.
new_column_nameslist of str: The splitted column names

Returns:

DataFrame: New DataFrame object with splitted columns as specified.

Examples

>>> df.collect()
   ID     COL
0   1   1,2,3
1   2   3,4,4

>>> df.split_column(column="COL", separator=",", new_column_names=["COL1", "COL2", "COL3"]).collect()
   ID    COL COL1 COL2 COL3
0   1  1,2,3    1    2    3
1   2  3,4,4    3    4    4

concat_columns(columns, separator)

Returns a new DataFrame with splitted column.

Parameters:

columnslist of str: A list of columns to be concatenated.
separatorstr: The separator.

Returns:

DataFrame: New DataFrame object with concat column as specified.

Examples

>>> df.collect()
   ID  COL1 COL2 COL3
0   1     1    2    3
1   2     3    4    4

>>> df.concat_columns(columns=["COL1", "COL2", "COL3"], separator=",").collect()
   ID  COL1 COL2 COL3 COL1,COL2,COL3
0   1     1    2    3          1,2,3
1   2     3    4    4          3,4,4

nullif(value)

Replace certain value with NULL value.

Parameters:

value: scalar or dict: To-be-replaced value. If the type is dict, its key indicates the column name.

replace(to_replace=None, value=None, regex=False)

Returns a new DataFrame with replaced value.

Parameters:

to_replacenumeric, str or dict, optional

The value/pattern to be replaced. If regex is True, the regex will be used instead of value.

numeric or str: the value equal to to_replace will be replaced by value.
dict:
- value is None: the value equal to the key of to_replace will be replaced by its value. If it is nested JSON, the to-be-replaced columns will be restricted by the first-level keys.
- value is numeric or str: the value equal to the value of to_replace will be replaced by value. The to-be-replaced columns will be restricted by the keys of to_replace.
- value is dict: the value equal to the value of to_replace will be replaced by the value of value under the same key. The to-be-replaced columns and the replaced value will be restricted by the keys of to_replace and value.
None: regex will be used.

valuenumeric, str or dict, optional

Value to replace.

numeric or str: The value to replace to_replace or according to the pattern if regex is given.
dict: the replacement will take place under the columns equal to the keys.

regexbool or str, optional

Use regex or not.

bool: use regex if True.
str: work the same as to_replace if to_replace is None.

Returns:

DataFrame: New DataFrame object with replaced values.

Examples

>>> df.collect()
    Aa  Bb  Cb
  0   5   a
 10   0   b
  2   7   c
  3   8   d
  4   9   e

>>> df.replace(to_replace=0, value=5).collect()
    Aa  Bb  Cb
  5   5   a
 10   5   b
  2   7   c
  3   8   d
  4   9   e

>>> df.replace(to_replace={0: 10, 1: 100}).collect()
    Aa  Bb  Cb
 10   5   a
 10  10   b
  2   7   c
  3   8   d
  4   9   e

>>> df.replace(to_replace={'Aa': 0, 'Bb': 5}, value=100).collect()
    Aa   Bb  Cb
100  100   a
 10    0   b
  2    7   c
  3    8   d
  4    9   e

>>> df.replace(to_replace={'Aa': 0, 'Bb': 5}, value={'Aa': 100, 'Bb': 50}).collect()
     Aa  Bb  Cb
 100  50   a
  10   0   b
   2   7   c
   3   8   d
   4   9   e

>>> df.replace(to_replace={'Aa': {0: 100, 4: 400}}).collect()
     Aa  Bb  Cb
 100   5   a
  10   0   b
   2   7   c
   3   8   d
 400   9   e

>>> df2.collect()
      A    B
0   bat  abc
1   foo  bar
2  bait  xyz

>>> df2.replace(to_replace=r'^ba.$', value='new', regex=True).collect()
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df2.replace(to_replace={'A': r'^ba.$'}, value={'A': 'new'}, regex=True).collect()
      A    B
0   new  abc
1   foo  bar
2  bait  xyz

>>> df2.replace(regex=r'^ba.$', value='new').collect()
     A     B
0   new  abc
1   foo  new
2  bait  xyz

>>> df2.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'}).collect()
      A    B
0   new  abc
1   xyz  new
2  bait  xyz

sort(cols, desc=False)

Returns a new DataFrame sorted by the specified columns.

Parameters:

colsstr or list of str: A column or list of columns to sort by. If a list is specified, then the sort order in parameter desc is used for all columns.
descbool, optional: Set to True to sort in descending order. Defaults to False, for ascending order. Default value is False.

Returns:

DataFrame: New DataFrame object with rows sorted as specified.

sort_values(by, ascending=True)

Returns a new DataFrame sorted by the specified columns.

Parameters:

bystr or list of str

A column or list of columns to sort by. If a list is specified, then the sort order in parameter ascending is used for all columns.

ascendingbool, optional

Set to False to sort in descending order.

Defaults to True, for ascending order.

Returns:

DataFrame: New DataFrame object with rows sorted as specified.

sort_index(ascending=True)

Returns a new DataFrame sorted by the index.

Parameters:

ascendingbool, optional

Set to False to sort in descending order. Defaults to False, for ascending order.

Defaults to True

Returns:

DataFrame: New DataFrame object with rows sorted as specified.

select(*cols)

Returns a new DataFrame with columns derived from the current DataFrame.

Warning

There is no check that inputs interpreted as SQL expressions are actually valid expressions; an "expression" like "A FROM TAB; DROP TABLE IMPORTANT_THINGS; SELECT A" can cause a lot of damage.

Parameters:

colsstr or (str, str) tuple.: The columns in the new DataFrame. A string is treated as the name of a column to select; a (str, str) tuple is treated as (SQL expression, alias). As a special case, '*' is expanded to all columns of the original DataFrame.

Returns:

DataFrame: A new DataFrame object with the specified columns.

Raises:

hana_ml.ml_exceptions.BadSQLError: If comments or malformed tokens are detected in a column expression. May have false positives and false negatives.

Examples

Input:

>>> df.collect()
   A  B  C
0  1  2  3

Selecting a subset of existing columns:

>>> df.select('A', 'B').collect()
   A  B
0  1  2

Computing a new column based on existing columns:

>>> df.select('*', ('B * 4', 'D')).collect()
   A  B  C  D
0  1  2  3  8

set_operations(other, all=True, op='UNION')

Combines this DataFrame's rows and another DataFrame's rows into one DataFrame. This operation is equivalent to a SQL UNION/INTERSECT/EXCEPT ALL.

Parameters:

otherDataFrame, list of DataFrame: The right side of the operation.
allbool, optional: If True, keep duplicate rows; equivalent to UNION/INTERSECT/EXCEPT ALL in SQL. If False, keep only one copy of duplicate rows (even if they come from the same side of the operation); equivalent to a UNION/INTERSECT/EXCEPT or a UNION/INTERSECT/EXCEPT ALL followed by DISTINCT in SQL. Defaults to True.

Returns:

DataFrame: The combined data from self and other.

Examples

We have two DataFrames we want to union, with some duplicate rows:

>>> df1.collect()
   A  B
0  1  2
1  1  2
2  2  3

>>> df2.collect()
   A  B
0  2  3
1  3  4

>>> df1.set_operations(df2, op='UNION').collect()
   A  B
1  2
1  2
2  3
2  3
3  4

If we want to use except instead of union, we can do:

>>> df1.union(df2, all=False, op='except').collect()
   A  B
0  1  2
1  1  2

union(other, all=True)

Combines this DataFrame's rows and another DataFrame's rows into one DataFrame. This operation is equivalent to a SQL UNION ALL.

Parameters:

otherDataFrame, list of DataFrame: The right side of the union.
allbool, optional: If True, keep duplicate rows; equivalent to UNION ALL in SQL. If False, keep only one copy of duplicate rows (even if they come from the same side of the union); equivalent to a UNION or a UNION ALL followed by DISTINCT in SQL. Defaults to True.

Returns:

DataFrame: The combined data from self and other.

Examples

We have two DataFrames we want to union, with some duplicate rows:

>>> df1.collect()
   A  B
0  1  2
1  1  2
2  2  3

>>> df2.collect()
   A  B
0  2  3
1  3  4

union() produces a DataFrame that contains all rows of both df1 and df2, like a UNION ALL:

>>> df1.union(df2).collect()
   A  B
1  2
1  2
2  3
2  3
3  4

To get the deduplication effect of a UNION DISTINCT, pass all=False or call distinct() after union():

>>> df1.union(df2, all=False).collect()
   A  B
0  1  2
1  2  3
2  3  4
>>> df1.union(df2).distinct().collect()
   A  B
0  1  2
1  2  3
2  3  4

collect(fetch_size=32000, geometries=True, convert_geo_to_shapely=True)

Copies the current DataFrame to a new Pandas DataFrame.

Parameters:

fetch_sizeint, optional

Fetch size in hdbcli.

geometriesbool, optional

With this flag set to True (default), the geometries are converted to Well-Known-Text representations in the resulting Pandas dataframe. Even if they are converted to Shapely objects (see convert_geo_to_shapely), when you print the dataframe, the geometry columns are represented as Well-Known-Text.

If you need the raw binary values, set this flag to False

convert_geo_to_shapelybool, optional

If set to True (default), all geometry columns will be converted to a Shapely object, so that the dataframe can be directly used in visualization libraries for example. If your processing does not support Shapely objects, you can switch this conversion off. In this case the columns remain of type String.

Note

Before the conversion, it is checked, if any value in the column is NULL. If so, this column will not be converted to a Shapely object, because NULL WKT strings are not supported by Shapely.

Returns:

pandas.DataFrame: A Pandas DataFrame that contains the current DataFrame's data.

Examples

Viewing a hana_ml DataFrame doesn't execute the underlying SQL or fetch the data:

>>> df = cc.table('T')
>>> df
<hana_ml.dataframe.DataFrame object at 0x7f2b7f24ddd8>

Using collect() executes the SQL and fetches the results into a Pandas DataFrame:

>>> df.collect()
   A  B
0  1  3
1  2  4

>>> type(df.collect())
<class 'pandas.core.frame.DataFrame'>

property geometries: list

Returns the geometries of a data frame. The list is empty if there are none.

Returns:

list: List with geometry columns

property srids: dict

Returns the srid for each geometry column in the dataframe. If none is found, the dictionary will be {}.

For dataframes based on HANA catalog objects, the information is read from the catalog. For Dataframes, which do not have a catalog object (e.g. are based on SQL statements, or temporary tables ), the SRID is derived by selecting the first row in the table and read it directly from the EWKB. For columns with multiple SRSes (SRID NULL), this means, that you might get back a SRS, which differs from other entries in the same column.

Known Limitation: For dataframes which don't have catalog objects and do not contain data, no SRID can be provided.

Returns:

dict: Dictionary with the SRID per column: {<column_name>: <srid>} Returns {} when none are found.

rename_columns(names)

Returns a DataFrame with renamed columns.

Parameters:

nameslist or dict: If a list, specifies new names for every column in this DataFrame. If a dict, each dict entry maps an existing name to a new name, and not all columns need to be renamed.

Returns:

DataFrame: The same data as the original DataFrame with new column names.

See also

DataFrame.alias: For renaming the DataFrame itself.

Examples

>>> df.collect()
   A  B
0  1  3
1  2  4

>>> df.rename_columns(['C', 'D']).collect()
   C  D
0  1  3
1  2  4

>>> df.rename_columns({'B': 'D'}).collect()
   A  D
0  1  3
1  2  4

auto_cast(type_convert)

Returns a DataFrame with converted column types.

Parameters:

type_convertdict: Specifies the original type and the new type in Dict. e.g. {"INT": "DOUBLE"}

Returns:

DataFrame: The same data as this DataFrame, but with columns cast to the specified type.

cast(cols, new_type=None)

Returns a DataFrame with columns cast to a new type.

The name of the column in the returned DataFrame is the same as the original column.: Warning

Type compatibility between the existing column type and the new type is not checked. An incompatibility results in an error.

Parameters:

colsstr, list of str or dict: The column(s) to be cast to a different type.
new_typestr: The database datatype to cast the column(s) to. No checks are performed to see if the new type is valid. An invalid type can lead to SQL errors or even SQL injection vulnerabilities.

Returns:

DataFrame: The same data as this DataFrame, but with columns cast to the specified type.

Examples

Input:

>>> df1 = cc.sql('SELECT "AGE", "PDAYS", "HOUSING" FROM DBM_TRAINING_TBL')
>>> df1.dtypes()
[('AGE', 'INT', 10, 10, 10, 0), ('PDAYS', 'INT', 10, 10, 10, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]

Casting a column to NVARCHAR(20):

>>> df2 = df1.cast('AGE', 'NVARCHAR(20)')
>>> df2.dtypes()
[('AGE', 'NVARCHAR', 20, 20, 20, 0), ('PDAYS', 'INT', 10, 10, 10, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]

Casting a list of columns to NVARCHAR(50):

>>> df3 = df1.cast(['AGE', 'PDAYS'], 'NVARCHAR(50)')
>>> df3.dtypes()
[('AGE', 'NVARCHAR', 50, 50, 50, 0), ('PDAYS', 'NVARCHAR', 50, 50, 50, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]
>>> df4 = df1.cast({'AGE': 'VARCHAR(50)', {'PDAYS': 'INT'}})
>>> df4.dtypes()
[('AGE', 'VARCHAR', 50, 50, 50, 0), ('PDAYS', 'INT', 50, 50, 50, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]

tail(n=1, ref_col=None)

Returns a DataFrame of the last n rows in the current DataFrame.

Parameters:

nint, optional

The number of rows returned.

Defaults to 1.

ref_colstr or list of str, optional

Sorting the dataframe based on the ref_col column.

Defaults to None.

Returns:

DataFrame: A new DataFrame of the last n rows of the current DataFrame.

to_head(col)

Returns a DataFrame with specified column as the first item in the projection.

Parameters:

colstr: The column to move to the first position.

Returns:

DataFrame: The same data as this DataFrame but with the specified column in the first position.

Examples

Input:

>>> df1 = cc.table("DBM_TRAINING")
>>> import pprint
>>> pprint.pprint(df1.columns)
['ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED',
 'LABEL']

Moving the column 'LABEL' to head:

>>> df2 = df1.to_head('LABEL')
>>> pprint.pprint(df2.columns)
['LABEL',
 'ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED']

to_tail(col)

Returns a DataFrame with specified column as the last item in the projection.

Parameters:

colstr: The column to move to the last position.

Returns:

DataFrame: The same data as this DataFrame but with the specified column in the last position.

Examples

Input:

>>> df1 = cc.table("DBM_TRAINING")
>>> import pprint
>>> pprint.pprint(df1.columns)
['LABEL',
 'ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED']

Moving the column 'LABEL' to head:

>>> df2 = df1.to_tail('LABEL')
>>> pprint.pprint(df2.columns)
['ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED',
 'LABEL']

summary(cols=None, select_stat=None, pivoted=True)

Returns a DataFrame that contains various statistics for the requested column(s).

Parameters:

colsstr or list, optional

The column(s) to be described. Defaults to all columns.

select_statstr or list, optional

The statistics to be selected. "nulls", "unique" and "valid observations" are always computed. - "min": the minimum value

"lower_quartile": the lower quartile
"median": the median value
"upper_quartile": the upper quartile
"max": the max value
"mean": the average
"lower_mean_ci" : the lower bound of the mean value
"upper_mean_ci": the upper bound of the mean value
"trimmed_mean": the average of data after taking out 5% of the largest elements and the same amount of the smallest elements
"variance": the variance
"sd": the standard deviation
"skewness": the measure of the asymmetry of a distribution
"kurtosis": the measure of the tailedness of a distribution

pivotedbool, optional

Whether output format is pivoted.

Default to True.

Returns

-------

DataFrame

A DataFrame that contains (selected) statistics for the specified column(s).

property stats: The statistics of the dataframe.

describe(cols=None, version='v1')

Returns a DataFrame that contains various statistics for the requested column(s).

Parameters:

colsstr or list, optional

The column(s) to be described. Defaults to all columns.

version{'v1', 'v2'}, optional

Version v2 will use PAL instead of dynamic SQL.

Defaults to 'v1'.

Returns:

DataFrame

A DataFrame that contains statistics for the specified column(s) in the current DataFrame.

The statistics included are:

the count of rows ("count"),

the number of distinct values ("unique"),

the number of nulls ("nulls"),

the average ("mean"),

the standard deviation("std")

the median ("median"),

the minimum value ("min"),

the maximum value ("max"),

the 25% percentile when treated as continuous variable ("25_percent_cont"),

the 25% percentile when treated as discrete variable ("25_percent_disc"),

the 50% percentile when treated as continuous variable ("50_percent_cont"),

the 50% percentile when treated as discrete variable ("50_percent_disc"),

the 75% percentile when treated as continuous variable ("75_percent_cont"),

the 75% percentile when treated as discrete variable ("75_percent_disc").

For columns that are strings, statistics such as average ("mean"), standard deviation ("std"), median ("median"), and the various percentiles are NULLs.

If the list of columns contain both string and numeric data types, minimum and maximum values become NULLs.

Examples

Input:

>>> df1 = cc.table("DBM_TRAINING")
>>> import pprint
>>> pprint.pprint(df2.columns)
['LABEL',
 'ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED']

Describe a few numeric columns and collect them to return a Pandas DataFrame:

>>> df1.describe(['AGE', 'PDAYS']).collect()
  column  count  unique  nulls        mean         std  min  max  median
0    AGE  16895      78      0   40.051376   10.716907   17   98      38
1  PDAYS  16895      24      0  944.406688  226.331944    0  999     999
   25_percent_cont  25_percent_disc  50_percent_cont  50_percent_disc
0             32.0               32             38.0               38
1            999.0              999            999.0              999
   75_percent_cont  75_percent_disc
0             47.0               47
1            999.0              999

Describe some non-numeric columns and collect them to return a Pandas DataFrame:

>>> df1.describe(['JOB', 'MARITAL']).collect()
    column  count  unique  nulls  mean   std       min      max median
0      JOB  16895      12      0  None  None    admin.  unknown   None
1  MARITAL  16895       4      0  None  None  divorced  unknown   None
  25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc
0            None            None            None            None
1            None            None            None            None
  75_percent_cont 75_percent_disc
0            None            None
1            None            None

Describe all columns in a DataFrame:

>>> df1.describe().collect()
            column  count  unique  nulls          mean           std
             ID  16895   16895      0  21282.286652  12209.759725
            AGE  16895      78      0     40.051376     10.716907
       DURATION  16895    1267      0    263.965670    264.331384
       CAMPAIGN  16895      35      0      2.344658      2.428449
          PDAYS  16895      24      0    944.406688    226.331944
       PREVIOUS  16895       7      0      0.209529      0.539450
   EMP_VAR_RATE  16895      10      0     -0.038798      1.621945
 CONS_PRICE_IDX  16895      26      0     93.538844      0.579189
  CONS_CONF_IDX  16895      26      0    -40.334123      4.865720
      EURIBOR3M  16895     283      0      3.499297      1.777986
    NREMPLOYED  16895      11      0   5160.371885     75.320580
           JOB  16895      12      0           NaN           NaN
       MARITAL  16895       4      0           NaN           NaN
     EDUCATION  16895       8      0           NaN           NaN
   DBM_DEFAULT  16895       2      0           NaN           NaN
       HOUSING  16895       3      0           NaN           NaN
          LOAN  16895       3      0           NaN           NaN
       CONTACT  16895       2      0           NaN           NaN
     DBM_MONTH  16895      10      0           NaN           NaN
   DAY_OF_WEEK  16895       5      0           NaN           NaN
      POUTCOME  16895       3      0           NaN           NaN
         LABEL  16895       2      0           NaN           NaN
         min        max     median  25_percent_cont  25_percent_disc
    5.000  41187.000  21786.000        10583.500        10583.000
   17.000     98.000     38.000           32.000           32.000
    0.000   4918.000    184.000          107.000          107.000
    1.000     43.000      2.000            1.000            1.000
    0.000    999.000    999.000          999.000          999.000
    0.000      6.000      0.000            0.000            0.000
   -3.400      1.400      1.100           -1.800           -1.800
   92.201     94.767     93.444           93.075           93.075
  -50.800    -26.900    -41.800          -42.700          -42.700
    0.634      5.045      4.856            1.313            1.313
4963.000   5228.000   5191.000         5099.000         5099.000
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
     NaN        NaN        NaN              NaN              NaN
    50_percent_cont  50_percent_disc  75_percent_cont  75_percent_disc
       21786.000        21786.000        32067.500        32068.000
          38.000           38.000           47.000           47.000
         184.000          184.000          324.000          324.000
           2.000            2.000            3.000            3.000
         999.000          999.000          999.000          999.000
           0.000            0.000            0.000            0.000
           1.100            1.100            1.400            1.400
          93.444           93.444           93.994           93.994
         -41.800          -41.800          -36.400          -36.400
           4.856            4.856            4.961            4.961
       5191.000         5191.000         5228.000         5228.000
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN
            NaN              NaN              NaN              NaN

bin(col, strategy='uniform_number', bins=None, bin_width=None, bin_column='BIN_NUMBER')

Returns a DataFrame with the original columns as well as bin assignments.

The name of the columns in the returned DataFrame is the same as the original column. Column "BIN_NUMBER" or the specified value in bin_column is added and corresponds to the bin assigned.

Parameters:

colstr

The column on which binning is performed. The column must be numeric.

strategy{'uniform_number', 'uniform_size'}, optional

Binning methods:

'uniform_number': Equal widths based on the number of bins.

'uniform_size': Equal widths based on the bin size.

Default value is 'uniform_number'.

binsint, optional

The number of equal-width bins. Only valid when strategy is 'uniform_number'.

Defaults to 10.

bin_widthint, optional

The interval width of each bin. Only valid when strategy is 'uniform_size'.

bin_columnstr, optional

The name of the output column that contains the bin number.

Returns:

DataFrame: A binned dataset with the same data as this DataFrame, as well as an additional column "BIN_NUMBER" or the value specified in bin_column. This additional column contains the assigned bin for each row.

Examples

Input:

>>> df.collect()
   C1   C2    C3       C4
 1  1.2   2.0      1.0
 2  1.4   4.0      3.0
 3  1.6   6.0      9.0
 4  1.8   8.0     27.0
 5  2.0  10.0     81.0
 6  2.2  12.0    243.0
 7  2.4  14.0    729.0
 8  2.6  16.0   2187.0
 9  2.8  18.0   6561.0
10  3.0  20.0  19683.0

Create five bins of equal widths on C1:

>>> df.bin('C1', strategy='uniform_number', bins=5).collect()
   C1   C2    C3       C4  BIN_NUMBER
 1  1.2   2.0      1.0           1
 2  1.4   4.0      3.0           1
 3  1.6   6.0      9.0           2
 4  1.8   8.0     27.0           2
 5  2.0  10.0     81.0           3
 6  2.2  12.0    243.0           3
 7  2.4  14.0    729.0           4
 8  2.6  16.0   2187.0           4
 9  2.8  18.0   6561.0           5
10  3.0  20.0  19683.0           5

Create five bins of equal widths on C2:

>>> df.bin('C3', strategy='uniform_number', bins=5).collect()
   C1   C2    C3       C4  BIN_NUMBER
 1  1.2   2.0      1.0           1
 2  1.4   4.0      3.0           1
 3  1.6   6.0      9.0           2
 4  1.8   8.0     27.0           2
 5  2.0  10.0     81.0           3
 6  2.2  12.0    243.0           3
 7  2.4  14.0    729.0           4
 8  2.6  16.0   2187.0           4
 9  2.8  18.0   6561.0           5
10  3.0  20.0  19683.0           5

Create five bins of equal widths on a column that varies significantly:

>>> df.bin('C4', strategy='uniform_number', bins=5).collect()
   C1   C2    C3       C4  BIN_NUMBER
 1  1.2   2.0      1.0           1
 2  1.4   4.0      3.0           1
 3  1.6   6.0      9.0           1
 4  1.8   8.0     27.0           1
 5  2.0  10.0     81.0           1
 6  2.2  12.0    243.0           1
 7  2.4  14.0    729.0           1
 8  2.6  16.0   2187.0           1
 9  2.8  18.0   6561.0           2
10  3.0  20.0  19683.0           5

Create bins of equal width:

>>> df.bin('C1', strategy='uniform_size', bin_width=3).collect()
   C1   C2    C3       C4  BIN_NUMBER
 1  1.2   2.0      1.0           1
 2  1.4   4.0      3.0           1
 3  1.6   6.0      9.0           2
 4  1.8   8.0     27.0           2
 5  2.0  10.0     81.0           2
 6  2.2  12.0    243.0           3
 7  2.4  14.0    729.0           3
 8  2.6  16.0   2187.0           3
 9  2.8  18.0   6561.0           4
10  3.0  20.0  19683.0           4

agg(agg_list, group_by=None)

Returns a SAP HANA DataFrame with the group_by column along with the aggregates. This method supports all aggregation functions in the SAP HANA database instance, such as 'max', 'min', 'count', 'avg', 'sum', 'median', 'stddev', 'var'. The name of the column in the returned DataFrame is the same as the original column.

Aggregation functions can be referred to SAP HANA aggregate functions.

Parameters:

agg_listA list of tuples

A list of tuples. Each tuple is a triplet. The triplet consists of (aggregate_operator, expression, name) where:

aggregate_operator is one of ['max', 'min', 'count', 'avg', 'sum', 'median', 'stddev', 'var', ...]. The operator name is identical to SAP HANA sql naming and we support all aggregation functions in the SAP HANA database instance.

expression is a str that is a column or column expression

name is the name of this aggregate in the project list.

group_bystr or list of str, optional

The group by column. Only a column is allowed although expressions are allowed in SQL. To group by an expression, create a DataFrame by providing the entire SQL. So, if you have a table T with columns C1, C2, and C3 that are all integers, to calculate the max(C1) grouped by (C2+C3) a DataFrame would need to be created as below:

>>> cc.sql('SELECT "C2"+"C3", max("C1") FROM "T" GROUP BY "C2"+"C3"')

Returns:

DataFrame: A DataFrame containing the group_by column (if it exists), as well as the aggregate expressions that are aliased with the specified names.

Examples

Input:

>>> df.collect()
    ID  SEPALLENGTHCM  SEPALWIDTHCM  PETALLENGTHCM  PETALWIDTHCM          SPECIES
  1            5.1           3.5            1.4           0.2      Iris-setosa
  2            4.9           3.0            1.4           0.2      Iris-setosa
  3            4.7           3.2            1.3           0.2      Iris-setosa
 51            7.0           3.2            4.7           1.4  Iris-versicolor
 52            6.4           3.2            4.5           1.5  Iris-versicolor
101            6.3           3.3            6.0           2.5   Iris-virginica
102            5.8           2.7            5.1           1.9   Iris-virginica
103            7.1           3.0            5.9           2.1   Iris-virginica
104            6.3           2.9            5.6           1.8   Iris-virginica

Another way to do a count:

>>> df.agg([('count', 'SPECIES', 'COUNT')]).collect()
    COUNT
0      9

Get counts by SPECIES:

>>> df.agg([('count', 'SPECIES', 'COUNT')], group_by='SPECIES').collect()
           SPECIES  COUNT
0  Iris-versicolor      2
1   Iris-virginica      4
2      Iris-setosa      3

Get max values of SEPALLENGTHCM by SPECIES:

>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH')], group_by='SPECIES').collect()
           SPECIES  MAX_SEPAL_LENGTH
0  Iris-versicolor               7.0
1   Iris-virginica               7.1
2      Iris-setosa               5.1

Get max and min values of SEPALLENGTHCM by SPECIES:

>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH'),
    ('min', 'SEPALLENGTHCM', 'MIN_SEPAL_LENGTH')], group_by=['SPECIES']).collect()
           SPECIES  MAX_SEPAL_LENGTH  MIN_SEPAL_LENGTH
0  Iris-versicolor               7.0               6.4
1   Iris-virginica               7.1               5.8
2      Iris-setosa               5.1               4.7

Get aggregate grouping by multiple columns:

>>> df.agg([('count', 'SEPALLENGTHCM', 'COUNT_SEPAL_LENGTH')],
            group_by=['SPECIES', 'PETALLENGTHCM']).collect()
           SPECIES  PETALLENGTHCM  COUNT_SEPAL_LENGTH
 Iris-virginica            6.0                   1
    Iris-setosa            1.3                   1
 Iris-virginica            5.9                   1
 Iris-virginica            5.6                   1
    Iris-setosa            1.4                   2
Iris-versicolor            4.7                   1
Iris-versicolor            4.5                   1
 Iris-virginica            5.1                   1

is_numeric(cols=None)

Returns True if the column(s) in the DataFrame are numeric.

Parameters:

colsstr or list, optional

The column(s) to be tested for being numeric.

Defaults to all columns.

Returns:

bool: True if all the columns are numeric.

Examples

Input:

>>> df.head(5).collect()
   ID  SEPALLENGTHCM  SEPALWIDTHCM  PETALLENGTHCM  PETALWIDTHCM      SPECIES
 1            5.1           3.5            1.4           0.2  Iris-setosa
 2            4.9           3.0            1.4           0.2  Iris-setosa
 3            4.7           3.2            1.3           0.2  Iris-setosa
 4            4.6           3.1            1.5           0.2  Iris-setosa
 5            5.0           3.6            1.4           0.2  Iris-setosa

>>> pprint.pprint(df.dtypes())
[('ID', 'INT', 10, 10, 10, 0),
 ('SEPALLENGTHCM', 'DOUBLE', 15, 15, 15, 0),
 ('SEPALWIDTHCM', 'DOUBLE', 15, 15, 15, 0),
 ('PETALLENGTHCM', 'DOUBLE', 15, 15, 15, 0),
 ('PETALWIDTHCM', 'DOUBLE', 15, 15, 15, 0),
 ('SPECIES', 'NVARCHAR', 15, 15, 15, 0)]

Test a single column:

>>> df.is_numeric('ID')
True
>>> df.is_numeric('SEPALLENGTHCM')
True
>>> df.is_numeric(['SPECIES'])
False

Test a list of columns:

>>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'PETALWIDTHCM'])
True
>>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'SPECIES'])
False

corr(first_col, second_col)

Returns a DataFrame that gives the correlation coefficient between two numeric columns.

All rows with NULL values for first_col or second_col are removed prior to calculating the correlation coefficient.

Let col1 be the values of first_col and col2 be the values of second_col, then the correlation coefficient is:

1/(n-1) * sum((col1 - avg(col1)) * (col2 - avg(col2))) / (stddev(col1) * stddev(col2))

Parameters:

first_colstr: The first column for calculating the correlation coefficient.
second_colstr: The second column for calculating the correlation coefficient.

Returns:

DataFrame: A DataFrame with one value that contains the correlation coefficient. The name of the column is CORR_COEFF.

Examples

Input:

>>> df.columns
['C1', 'C2', 'C3', 'C4']
>>> df.collect()
   C1   C2      C3       C4
 1  1.2     2.0      1.0
 2  1.4     4.0      3.0
 3  1.6     8.0      9.0
 4  1.8    16.0     27.0
 5  2.0    32.0     81.0
 6  2.2    64.0    243.0
 7  2.4   128.0    729.0
 8  2.6   256.0   2187.0
 9  2.8   512.0   6561.0
10  3.0  1024.0  19683.0

Correlation with columns that are well correlated:

>>> df.corr('C1', 'C2').collect()
   CORR_COEFF
0         1.0

>>> df.corr('C1', 'C3').collect()
   CORR_COEFF
0         1.0

Correlation with a column whose value is three times its previous value:

>>> df.corr('C1', 'C4').collect()
   CORR_COEFF
0    0.696325

min()

Gets the minimum value of the columns. It simplifies the use of agg function.

Returns:

scalar or Series

max()

Gets the maximum value of the columns. It simplifies the use of agg function.

Returns:

scalar or Series

sum()

Gets the summation of the columns. It simplifies the use of agg function.

Returns:

scalar or Series

median()

Gets the median value of the columns. It simplifies the use of agg function.

Returns:

scalar or Series

mean()

Gets the mean value of the columns. It simplifies the use of agg function.

Returns:

scalar or Series

stddev()

Gets the stddev value of the columns. It simplifies the use of agg function.

Returns:

scalar or Series

value_counts(subset=None)

Gets the value counts of the columns. It simplifies the use of agg function.

Parameters:

subsetlist, optional: Columns to use when counting unique combinations.

Returns:

DataFrame

pivot_table(values, index, columns, aggfunc='avg')

Returns a DataFrame that gives the pivoted table.

aggfunc is identical to SAP HANA aggregate functions.

Parameters:

valuesstr or list of str: The targeted values for pivoting.
indexstr or list of str: The index of the DataFrame.
columnsstr or list of str: The pivoting columns.
aggfunc{'avg', 'max', 'min',... }, optional: aggfunc is identical to SAP HANA aggregate functions. Defaults to 'avg'.

Returns:

DataFrame: A pivoted DataFrame.

Examples

df is a SAP HANA DataFrame.

>>> df.pivot_table(values='C2', index='C1', columns='C3', aggfunc='max')

generate_table_type()

Generates a SAP HANA table type based on the dtypes function of the DataFrame. This is a convenience method for SQL tracing.

Parameters:

None

Returns:

str: The table type in string form.

rearrange(key=None, features=None, label=None, type_ts=False, for_predict=False)

Utility function to generate a new dataframe with [key, features, label] for non time-series dataset and [key, label, features] for time-series dataset.

Parameters:

keystr, optional

Name of the ID column.

If key is not provided, then:

if data is indexed by a single column, then key defaults to that index column;

otherwise, it is assumed that data contains no ID column.

featureslist of str, optional

Names of the feature columns.

If features is not provided, it defaults to all non-ID, non-label columns.

labelstr, optional

Name of the dependent variable.

Defaults to

the name of the last non-ID column if type_ts is False

the name of the 1st non-ID column if type_ts is True.

type_tsstr, optional

Specifies whether or not the input DataFrame is time-series data.

Defaults to False.

for_predictstr, optional

Specifies whether data is for predict, in which case label should not be provided.

Defaults to False.

set_source_table(table, schema=None)

Specifies the source table for the current dataframe.

Parameters:

tablestr: The table name.
schemastr, optional, keyword-only: The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

to_pickle(path, compression='infer', protocol=4)

Pickle object to file.

Parameters:

pathstr

File path where the pickled object will be stored.

compression{'infer', 'gzip', 'bz2', 'zip', 'xz', None}

A string representing the compression to use in the output file. By default, infers from the file extension in specified path.

Defaults to 'infer'.

protocolint

Int which indicates which protocol should be used by the pickler, default HIGHEST_PROTOCOL.

to_datetime(cols)

Converts target columns to the specified date format. Default format is "YYYY-MM-DD HH24:MI:SS".

Parameters:

colsstr, list of str or dict: If cols is str or list of str, the default format will be used. Otherwise, the specified format should be provided in dict. E.g. cols={"DATETIME": "MM/DD/YYYY HH24:MI:SS"}

generate_feature(targets, group_by=None, agg_func=None, trans_func=None, order_by=None, trans_param=None, rolling_window=None, second_targets=None)

Add additional features to the existing dataframe using agg_func and trans_func.

Parameters:

targetsstr or list of str

The column(s) in data to be feature engineered.

group_bystr, optional

The column in data for group by when performing agg_func.

agg_funcstr, optional

HANA aggregation operations. SUM, COUNT, MIN, MAX, ...

trans_funcstr, optional

HANA transformation operations. MONTH, YEAR, LAG, ...

A special transformation is GEOHASH_HIERARCHY. This creates features based on a GeoHash. The default length of 20 for the hash can be influenced by respective trans parameters. Providing for example range(3, 11), the operation adds 7 features with a length of the GeoHash between 3 and 10.

order_bystr, optional

LEAD, LAG function requires an OVER(ORDER_BY) window specification.

trans_paramlist, optional

Parameters for transformation operations corresponding to targets.

rolling_windowint, optional

Window size for rolling function. If negative, it will use the points before CURRENT ROW.

second_targetsstr or list of str

The second column(s) in data to be feature engineered like CORR.

Returns:

DataFrame: SAP HANA DataFrame with new features.

Examples

>>> df.head(5).collect()
                   TIME  TEMPERATURE    HUMIDITY      OXYGEN          CO2
0   2021-01-01 12:00:00    19.972199   29.271170   23.154523   504.806395
1   2021-01-01 12:00:10    19.910014   27.931855   23.009835   507.515937
2   2021-01-01 12:00:20    19.834676   26.051309   22.756407   510.111974
3   2021-01-01 12:00:30    19.952517   26.007655   22.737376   516.993696
4   2021-01-01 12:00:40    20.163497   26.056979   22.469276   528.337481
>>> df.generate_feature(targets=["TEMPERATURE", "HUMIDITY", "OXYGEN", "CO2"],
                        trans_func="LAG",
                        order_by="TIME",
                        trans_param=[range(1, 7), range(1, 5), range(1, 5), range(1,7)]).dropna().deselect("TIME").head(2).collect()
  TEMPERATURE     HUMIDITY     OXYGEN          CO2 LAG(TEMPERATURE, 1)  ...  LAG(CO2, 4)
0   20.978001   26.187823   21.982030   522.731895           20.701740  ...  510.111974
1   21.234148   25.703989   21.804864   528.066402           20.978001  ...  516.993696

mutate(**kwargs)

Mutate columns in the dataframe.

Examples

>>> data.mutate(new_col=data.ID+2, data=data.data*3)

hana_ml.dataframe.read_pickle(connection_context, path, table_name, compression='infer', schema=None, force=False, replace=True, object_type_as_bin=False, table_structure=None, **kwargs)

Loads a pickled DataFrame object from file.

Parameters:

connection_contextConnectionContext

A connection to a SAP HANA database instance.

pathstr

File path where the pickled object will be loaded.

table_namestr

The table name in the SAP HANA database.

compression{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, optional

For on-the-fly decompression of on-disk data.

If 'infer', then use gzip, bz2, xz or zip if path ends in '.gz', '.bz2', '.xz', or '.zip' respectively, and no decompression otherwise.

Set to None for no decompression.

Defaults to 'infer'.

schemastr, optional, keyword-only

The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

forcebool, optional

If force is True, then the SAP HANA table with table_name is dropped.

Defaults to False.

replacebool, optional

If replace is True, then the SAP HANA table performs the missing value handling.

Defaults to True.

object_type_as_binbool, optional

If True, the object type will be considered CLOB in SAP HANA.

Defaults to False.

table_structuredict, optional

Manually define column types based on SAP HANA DB table structure.

Defaults to None.

hana_ml.dataframe.create_dataframe_from_pandas(connection_context, pandas_df, table_name, schema=None, force=False, replace=False, object_type_as_bin=False, table_structure=None, drop_exist_tab=True, allow_bigint=False, geo_cols: list = None, srid: int = 4326, primary_key: str = None, not_nulls: list = None, chunk_size=50000, disable_progressbar=False, upsert=False, append=False)

Uploads data from a Pandas DataFrame to a SAP HANA database and returns an SAP HANA DataFrame.

Parameters:

connection_contextConnectionContext

A connection to the SAP HANA database instance.

pandas_dfpandas.DataFrame

A Pandas DataFrame for uploading to the SAP HANA database. This can also be a GeoPandas dataframe, which will automatically be converted to a Pandas DataFrame.

table_namestr

The table name in the SAP HANA database.

schemastr, optional, keyword-only

The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

forcebool, optional

If force is True, then the SAP HANA table with table_name is truncated or dropped.

Defaults to False.

replacebool, optional

If replace is True, NULL will be replaced by 0.

Defaults to False.

object_type_as_binbool, optional

If True, the object type will be considered CLOB in SAP HANA.

Defaults to False.

table_structuredict

Manually define column types based on SAP HANA DB table structure.

drop_exist_tabbool, optional

If force is True, drop the existing table when drop_exist_tab is True and truncate the existing table when it is False.

Defaults to True.

allow_bigintbool, optional

allow_bigint decides whether int64 is mapped into INT or BIGINT in HANA.

Defaults to False.

geo_colslist, optional but required for spatial functions

Specifies the columns of the dataframe, which are treated as geometries. List elements can be either strings or tuples.

Strings represent columns which contain geometries in (E)WKT format. If the provided DataFrame is a GeoPandas DataFrame, you do not need to add the geometry column to the geo_cols. It will be detected and added automatically.

The column name in the HANA Table will be <column_name>_GEO

Tuples must consist of two or strings: (<longitude column>, <latitude column>)

longitude column: Dataframe column, that contains the longitude values

latitude column: Dataframe column, that contains the latitude values

They will be combined to a POINT(<longiturd> <latitude>) geometry.

The column name in the HANA Table will be <longitude>_<latitude>_GEO

Defaults to None.

sridint, optional but required for spatial functions

Spatial reference system id. If the SRS is not created yet, the system tries to create it automatically.

Defaults to 4326.

primary_keystr, optional but required for Graph functions

Name of the column in a node table which contains the unique identification of the node and corresponds with the edge table.

Defaults to None.

not_nullslist, optional but required for Graph functions

Contains all column names which should get a not null constraint. This is primarily for creating node and edge tables for a graph workspace

Defaults to None.

chunk_sizeint, optional

Specify the chunk size for upload.

Defaults to 50000.

disable_progressbarbool, optional

Disable the progress bar.

Defaults to False.

upsertbool, optional

Enable upsert with primary key if True.

Defaults to False.

appendbool, optional

Enable append if True.

Defaults to False.

Returns:

DataFrame: A SAP HANA DataFrame that contains the data in the pandas_df.

Examples

>>> create_dataframe_from_pandas(connection_context,p_df,'test',force=False,replace=True,drop_exist_tab=False)
<hana_ml.dataframe.DataFrame at 0x7efbcb26fbe0>

>>> create_dataframe_from_pandas(
        connection_context,
        p_df,
        "geo_table",
        force=False,
        replace=True,
        drop_exist_tab=False,
        geo_cols=["geometry", ("long", "lat")],
    )
<hana_ml.dataframe.DataFrame at 0x5eabcb27fbe0>

hana_ml.dataframe.create_dataframe_from_spark(connection_context, spark_df, table_name, schema=None, force=False, object_type_as_bin=False, table_structure=None, drop_exist_tab=True, allow_bigint=False, primary_key: str = None, not_nulls=None, chunk_size=50000, disable_progressbar=False, upsert=False, append=False)

Uploads data from a Spark DataFrame to a SAP HANA database and returns an SAP HANA DataFrame.

Parameters:

connection_contextConnectionContext

A connection to the SAP HANA database instance.

spark_dfpandas.DataFrame

A Spark DataFrame for uploading to the SAP HANA database.

table_namestr

The table name in the SAP HANA database.

schemastr, optional, keyword-only

The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

forcebool, optional

If force is True, then the SAP HANA table with table_name is truncated or dropped.

Defaults to False.

object_type_as_binbool, optional

If True, the object type will be considered CLOB in SAP HANA.

Defaults to False.

table_structuredict

Manually define column types based on SAP HANA DB table structure.

drop_exist_tabbool, optional

If force is True, drop the existing table when drop_exist_tab is True and truncate the existing table when it is False.

Defaults to True.

allow_bigintbool, optional

allow_bigint decides whether int64 is mapped into INT or BIGINT in HANA.

Defaults to False.

primary_keystr, optional

Name of the column in a node table which contains the unique identification of the node and corresponds with the edge table.

Defaults to None.

not_nullslist, optional

Contains all column names which should get a not null constraint.

Defaults to None.

chunk_sizeint, optional

Specify the chunk size for upload.

Defaults to 50000.

disable_progressbarbool, optional

Disable the progress bar.

Defaults to False.

upsertbool, optional

Enable upsert with primary key if True.

Defaults to False.

appendbool, optional

Enable append if True.

Defaults to False.

Returns:

DataFrame: A SAP HANA DataFrame that contains the data in the Spark DataFrame.

Examples

>>> create_dataframe_from_spark(connection_context,spark_df,'test',force=False,replace=True,drop_exist_tab=False)

hana_ml.dataframe.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=None)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

Parameters:

frameDataFrame

A SAP HANA DataFrame.

id_varsstr, tuple or list, optional

Column(s) to use as identifier variables.

Defaults to None.

value_varstuple or list, optional

Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

var_namescalar, optional

Name to use for the 'variable' column. If None it uses frame.columns.name or 'variable'.

value_namescalar, default 'value', optional

Name to use for the 'value' column.

Returns:

DataFrame: Unpivoted DataFrame.

Examples

>>> data.collect()
   A B C
0  a 1 2
1  b 3 4
2  c 5 6

>>> melt(data, id_vars=['A'], value_vars=['B', 'C']).collect()
   A variable value
a        B     1
a        C     2
b        B     5
b        C     6
c        B     3
c        C     4

hana_ml.dataframe.create_dataframe_from_shapefile(connection_context: ConnectionContext, shp_file: str, srid: int, table_name: str, schema: str = None) → DataFrame

Given a shapefile change the file into a DataFrame so that it is backed in SAP HANA. Expects that the shapefile name is a zip and/or will have both shp and dbf parts to create the single table. Once the table is created temporarily and locally, import the table as a shapefile into the target table through direct insertion. Then return the SAP HANA Dataframe that can be visualized.

Parameters:

connection_contextConnectionContext

A connection to the SAP HANA database instance.

shp_filestr

Path to a zipfile, shapefile or dbf-file. Filename suffix will be dropped and attempt to load both dbf and shp by that base name. Allowed suffixes are: .zip, .shp, .shx, and .dbf

sridint

The spatial reference id that applies to the list of columns in geo_cols. If the SRS is not created yet, the system tries to create it automatically.

table_namestr

The table name in the SAP HANA database.

schemastr, optional, keyword-only

The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

Returns:

DataFrame: A SAP HANA DataFrame with geometric columns that contains the data from the shp_file.

Examples

>>> cc = connection_context
>>> shapefile_path = os.path.join(os.getcwd(), 'myshape.shp')
>>> hana_df = create_dataframe_from_shapefile(
...   connection_context=cc,
...   shp_file=shapefile_path,
...   srid=4326,
...   table_name="myshape_tbl")

hana_ml.dataframe.import_csv_from(connection_context, directory, table, schema=None, threads=1, record_delimiter=None, field_delimiter=None, escape_character=None, column_list_first_row=None, credential=None)

Imports a csv file into the SAP HANA system. More details is shown in the <storage_path> section of chapter IMPORT FROM Statement (Data Import Export) of SAP HANA Cloud, SAP HANA Database SQL Reference Guide.

Parameters:

connection_contextConnectionContext

A connection to the SAP HANA database instance.

directorystr

Specifies the cloud storage location for the import. The locations HANA cloud support are Azure, Amazon(AWS) Google Cloud, SAP HANA Cloud, Data Lake Files(HDLFS).

tablestr

Specifies the name of target table.

Defaults to None.

schemastr, optional

Specifies the schema name of target table.

Defaults to None.

threadsint, optional

Specifies the number of threads that can be used for concurrent import. threads and batch provide high loading performance by enabling parallel loading and also by committing many records at once. In general, for column tables, a good setting to use is 10 parallel loading threads, with a commit frequency of 10,000 records or greater.

Defaults to 1 and then the maximum allowed value is 256.

record_delimiterstr, optional

Specifies the record delimiters used in the CSV file.

Defaults to None.

field_delimiterstr, optional

Specifies the field delimiters used in the CSV file.

Defaults to None.

escape_characterstr, optional

Specifies the escape character used in the import data.

Defaults to None.

column_list_first_rowbool, optional

Indicates that the column list is stored in the first row.

Defaults to None.

credentialstr, optional

Specifies the name of the credential defined in the CREATE CREDENTIAL statement. Since the credentials are defined within the credential, they no longer appear as plain text as part of import statements. The WITH CREDENTIAL clause cannot be specified when directory contains credentials. The WITH CREDENTIAL clause is required for imports to SAP HANA Cloud, Data Lake Files, but is optional for all other cloud platforms.

Defaults to None.

Examples

Assume we have a connection to a SAP HANA instance called conn and we want to import test.csv into a table called Test:

>>> import_csv_from(connection_context=conn,
                    directory='hdlfs://XXXXXXXX.com/test.csv',
                    table="Test",
                    threads=10,
                    column_list_first_row=True,
                    credential='XXXCredential')