hana_ml.dataframe

This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly requested.

The following classes and functions are available:

hana_ml.dataframe.quotename(name)

Escapes a schema, table, or column name for use in SQL. hana_ml functions and methods that take schema, table, or column names already escape their input by default, but those that take SQL don't (and can't) perform escaping automatically.

Parameters
namestr

The schema, table, or column name.

Returns
str

The escaped name. The string is surrounded in quotation marks, and existing quotation marks are escaped by doubling them.

class hana_ml.dataframe.ConnectionContext(address='', port=0, user='', password=None, autocommit=True, packetsize=None, userkey=None, spatialtypes=1, encrypt=None, sslValidateCertificate=None, pyodbc_connection=None, **properties)

Bases: object

Represents a connection to the SAP HANA system.

ConnectionContext includes methods for creating DataFrames from data on the SAP HANA. DataFrames are tied to a ConnectionContext, and are unusable once their ConnectionContext is closed.

Parameters
Same as hdbcli.dbapi.connect.

Please see the online docs for hdbcli.dbapi.connect for more details.

Examples

Querying data from SAP HANA into a Pandas DataFrame:

>>> with ConnectionContext('address', port, 'user', 'password') as cc:
...     df = (cc.table('MY_TABLE', schema='MY_SCHEMA')
...             .filter('COL3 > 5')
...             .select('COL1', 'COL2'))
...     pandas_df = df.collect()

The underlying hdbcli.dbapi.connect can be accessed if necessary:

>>> with ConnectionContext('127.0.0.1', 30215, 'MLGUY', 'manager') as cc:
...     cc.connection.setclientinfo('SOMEKEY', 'somevalue')
...     df = cc.sql('some sql that needs that session variable')
...     ...
Attributes
connectionhdbcli.dbapi.connect

The underlying dbapi connection. Use this connection to run SQL directly, or to access connection methods like getclientinfo/setclientinfo.

close()

Closes the existing connection to the SAP HANA system.

Parameters
None
Returns
None

Examples

cc is a connection to SAP HANA system.

>>> cc.close()
add_primary_key(table, columns, schema=None)

Add primary key to the existing table.

Parameters
tablestr

Table name.

columnsstr of list

Columns to be primary key.

schemastr, optional

Schema name. If None, use the current schema.

create_table(table, table_structure, schema=None, table_type='COLUMN', prop='', data_lake=False, data_lake_container='SYSRDL#CG')

Create a SAP HANA table.

Parameters
tablestr

Table name.

table_structuredict

SAP HANA table structure. {Column name: Column type, ...}

schemastr, optional

Schema name. If None, use the current schema.

table_typestr, optional

Specify the table type.

  • 'COLUMN', by default.

  • 'ROW'.

  • 'TEMPORARY'.

data_lakebool, optional

If True, create the data lake table by suing SYSRDL#CG.REMOTE_EXECUTE().

Examples
--------
cc is a connection to a SAP HANA system.
>>> cc.create_table(table='TEST_TBL', table_structure={'test_name': 'VARCHAR(50)'})
create_virtual_table(table, data_lake_table, schema=None, data_lake_container='SYSRDL#CG')

Create a SAP virtual HANA table at data lake source.

Parameters
tablestr

HANA virtual table name.

data_lake_tablestr

HANA data lake table name.

schemastr, optional

Schema name. If None, use the current schema.

drop_table(table, schema=None, data_lake=False, data_lake_container='SYSRDL#CG')

Drop the specified table.

Parameters
tablestr

Table name.

schemastr, optional

Schema name. If None, use the current schema.

data_lakebool, optional

If True, drop the data lake table.

Examples

cc is a connection to a SAP HANA system.

>>> cc.drop_table(table='TEST_TBL')
copy_to_data_lake(data, virtual_table, data_lake_table, schema=None, append=False, data_lake_container='SYSRDL#CG')

Copy HANA data to data lake table.

Parameters
dataDataFrame

HANA DataFrame.

virtual_tablestr

HANA virtual table name.

data_lake_tablestr

HANA data lake table name.

schemastr, optional

Schema name. If None, use the current schema.

appendbool, optional

Append data to the existing data lake table.

has_table(table, schema=None)

Returns the boolean value for the table existence.

Parameters
tablestr

Table name.

schemastr, optional

Schema name. If None, use the current schema.

Returns
bool

Table existence.

Examples

cc is a connection to a SAP HANA system.

>>> cc.has_table(table='TEST_TBL')
True
hana_version()

Returns the version of SAP HANA system.

Parameters
None
Returns
str

The SAP HANA version.

Examples

cc is a connection to a SAP HANA system.

>>> cc.hana_version()
'4.50.000.00.1581545459 (master)'
get_current_schema()

Returns the current schema name.

Parameters
None
Returns
str

The current schema name.

Examples

cc is a connection to a SAP HANA system.

>>> cc.get_current_schema()
'TEST'
get_tables(schema=None)

Returns the table list given schema.

Parameters
schemastr, optional

The schema name. If no specified, use the current schema.

hana_major_version()

Returns the major number of SAP HANA version.

Parameters
None
Returns
str

The major number of SAP HANA version.

Examples

cc is a connection to a SAP HANA system.

>>> cc.hana_major_version()
'4'
is_cloud_version()

Check whether the SAP HANA instance is cloud version or on-premise.

Parameters
None
Returns
bool

If True, the HANA instance is cloud version.

read_onnx(onnx_file, table, schema=None)

Read onnx file content and return HANA DataFrame.

Parameters
onnx_filestr

ONNX file location.

tablestr

HANA table to persist onnx model.

schemastr, optional

Specifies the schema.

sql(sql)

Returns a SAP HANA DataFrame representing a query.

Parameters
sqlstr

SQL query. The last sentence must be select statement.

Returns
DataFrame

The DataFrame for the query.

Examples

cc is a connection to a SAP HANA system.

>>> df = cc.sql('SELECT T.A, T2.B FROM T, T2 WHERE T.C=T2.C')
table(table, schema=None, save_source=True)

Returns a DataFrame that represents the specified table.

Parameters
tablestr

The table name.

schemastr, optional, keyword-only

The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

save_sourcebool, optional

If True, save the name of source table. Defaults to True.

Returns
DataFrame

The DataFrame that is selecting data from the specified table.

Examples

>>> df1 = cc.table('MY_TABLE')
>>> df2 = cc.table('MY_OTHER_TABLE', schema='MY_SCHEMA')
class hana_ml.dataframe.DataFrame(connection_context, select_statement, _name=None)

Bases: object

Represents a frame that is backed by a database SQL statement.

Parameters
connection_contextConnectionContext

The connection to the SAP HANA system.

select_statementstr

The SQL query backing the dataframe.

Note

Parameters beyond connection_context and select_statement are intended for internal use. Do not rely on them; they may change without notice.

property columns

Lists the current DataFrame's column names. Computed lazily and cached. Each access to this property creates a new copy; mutating the list does not alter or corrupt the DataFrame.

Parameters
None
Returns
list of str

A list of column names.

Examples

df is a SAP HANA DataFrame.

>>> df.columns
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)',
 'target']
property shape

Computes the shape of the SAP HANA DataFrame.

Parameters
None
Returns
tuple

(The number of rows, the number of columns) in the SAP HANA DataFrame.

Examples

df is a SAP HANA DataFrame.

>>> df.shape
    (1, 3)
property name

Returns the name of the DataFrame. This value does not correspond to a SAP HANA table name. This value is useful for joining predicates when the joining DataFrames have columns with the same name.

Parameters
None
Returns
str

A str of DataFrame name.

Examples

df is a SAP HANA DataFrame.

>>> df.name
'DT_1'
property quoted_name

Specifies the escaped name of the original DataFrame. Default-generated DataFrame names are safe to use in SQL without escaping, but names set with DataFrame.alias may require escaping.

Parameters
None
Returns
str

A str of DataFrame name.

Examples

df is a SAP HANA DataFrame.

>>> df.quoted_name
'"DT_1"'
declare_lttab_usage(usage)

Declares whether this DataFrame makes use of local temporary tables.

Some SAP HANA PAL execution routines can execute more efficiently if they know up front whether a DataFrame's SELECT statement requires access to local temporary tables.

Parameters
usagebool

Specifies whether this DataFrame uses local temporary tables.

add_id(id_col=None, ref_col=None)

Return a new SAP HANA DataFrame with a added <id_col> column.

Parameters
id_colstr, optional

The name of new ID column.

Defaults to "ID".

ref_colstr, optional

The id is generated based on ref_col.

Returns
DataFrame

A new SAP HANA DataFrame with a added <id_col> column.

Examples

df is a SAP HANA DataFrame.

>>> df.collect()
X    Y
20   30
40   5
>>> df.add_id(id_col='ID')
ID    X    Y
1     20   30
2     40   5
alias(alias)

Returns a new SAP HANA DataFrame with an alias set.

Parameters
aliasstr

The name of the DataFrame.

Returns
DataFrame

A SAP HANA DataFrame with an alias set.

See also

DataFrame.rename_columns

For renaming individual columns.

count()

Computes the number of rows in the SAP HANA DataFrame.

Parameters
None
Returns
int

The number of rows in the SAP HANA DataFrame.

Examples

df is a SAP HANA DataFrame.

>>> df.count()
150
diff(index, periods=1)

Returns a new SAP HANA DataFrame with differenced values.

Parameters
indexint or str

Index of the SAP HANA DataFrame.

periodsint, optional

Periods to shift for calculating difference, accepts negative values.

Defaults to 1.

Returns
DataFrame

DataFrame with differenced values. No calculation happens if it contains string.

drop(cols)

Returns a new SAP HANA DataFrame without the specified columns.

Parameters
colslist of str

The list of column names to be dropped.

Returns
DataFrame

A new SAP HANA DataFrame that retains only the columns not listed in cols.

Examples

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> df.drop(['B']).collect()
   A
0  1
1  2
distinct(cols=None)

Returns a new SAP HANA DataFrame with distinct values for the specified columns. If no columns are specified, then the distinct row values from all columns are returned.

Parameters
colsstr or list of str, optional

A column or list of columns to consider when getting distinct values. Defaults to use all columns.

Returns
DataFrame

The DataFrame with distinct values for cols.

Examples

Input:

>>> df.collect()
   A  B    C
0  1  A  100
1  1  A  101
2  1  A  102
3  1  B  100
4  1  B  101
5  1  B  102
6  1  B  103
7  2  A  100
8  2  A  100

Distinct values in a column:

>>> df.distinct("B").collect()
   B
0  A
1  B

Distinct values of a subset of columns:

>>> df.distinct(["A", "B"]).collect()
   A  B
0  1  B
1  2  A
2  1  A

Distinct values of the entire data set:

>>> df.distinct().collect()
   A  B    C
0  1  A  102
1  1  B  103
2  1  A  101
3  2  A  100
4  1  B  101
5  1  A  100
6  1  B  100
7  1  B  102
drop_duplicates(subset=None)

Returns a new SAP HANA DataFrame with duplicate rows removed. All columns in the DataFrame are returned. There is no way to keep specific duplicate rows.

Warning

Specifying a non-None value of subset may produce an unstable DataFrame, the contents of which may be different every time you look at it. Specifically, if two rows are duplicates in their subset columns and have different values in other columns, Then a different row could be picked every time you look at the result.

Parameters
subsetlist of str, optional

A list of columns to consider when deciding whether rows are duplicates of each other. Defaults to use all columns.

Returns
DataFrame

A DataFrame with only one copy of duplicate rows.

Examples

Input:

>>> df.collect()
   A  B    C
0  1  A  100
1  1  A  101
2  1  A  102
3  1  B  100
4  1  B  101
5  1  B  102
6  1  B  103
7  2  A  100
8  2  A  100

Drop duplicates based on the values of a subset of columns:

>>> df.drop_duplicates(["A", "B"]).collect()
   A  B    C
0  1  A  100
1  1  B  100
2  2  A  100

Distinct values on the entire data set:

>>> df.drop_duplicates().collect()
   A  B    C
0  1  A  102
1  1  B  103
2  1  A  101
3  2  A  100
4  1  B  101
5  1  A  100
6  1  B  100
7  1  B  102
dropna(how=None, thresh=None, subset=None)

Returns a new DataFrame with NULLs removed.

Parameters
how{'any', 'all'}, optional

If provided, 'any' eliminates rows with any NULLs, and 'all' eliminates rows that are entirely NULLs. If neither how nor thresh are provided, how defaults to 'any'.

threshint, optional

If provided, rows with fewer than thresh non-NULL values are dropped. You cannot specify both how and thresh.

subsetlist of str, optional

The columns to consider when looking for NULLs. Values in other columns are ignored, whether they are NULL or not. Defaults to all columns.

Returns
DataFrame

A new SAP HANA DataFrame with a SELECT statement that removes NULLs.

Examples

Dropping rows with any NULL:

>>> df.collect()
     A    B    C
0  1.0  3.0  5.0
1  2.0  4.0  NaN
2  3.0  NaN  NaN
3  NaN  NaN  NaN
>>> df.dropna().collect()
     A    B    C
0  1.0  3.0  5.0

Dropping rows that are entirely nulls:

>>> df.dropna(how='all').collect()
     A    B    C
0  1.0  3.0  5.0
1  2.0  4.0  NaN
2  3.0  NaN  NaN

Dropping rows with less than 2 non-null values:

>>> df.dropna(thresh=2).collect()
     A    B    C
0  1.0  3.0  5.0
1  2.0  4.0  NaN
deselect(cols)

Returns a new DataFrame without columns derived from the current DataFrame.

Parameters
colsstr or tuple/list of str.

The columns are excluded in the new DataFrame.

Returns
DataFrame

A new DataFrame object excluding the specified columns.

Examples

Input:

>>> df.collect()
   A  B  C
0  1  2  3

Selecting a subset of existing columns:

>>> df.deselect(['A', 'B']).collect()
   C
0  3
dtypes(subset=None)

Returns a sequence of tuples describing the DataFrame's SQL types.

The tuples list the name, SQL type name, display_size, internal_size, precision and scale corresponding to the DataFrame's columns.

Parameters
subsetlist of str, optional

The columns that the information is generated from. Defaults to all columns.

Returns
dtypeslist of tuples

Each tuple consists of the name, SQL type name, display_size, internal_size, precision and scale for one of the DataFrame's columns. The list is in the order specified by the subset, or in the DataFrame's original column order if a subset is not provided.

empty()

Returns True if this DataFrame has 0 rows.

Parameters
None
Returns
bool

True if the DataFrame is empty.

Notes

If a DataFrame contains only NULLs, it is not considered empty.

Examples

>>> df1.collect()
Empty DataFrame
Columns: [ACTUAL, PREDICT]
Index: []
>>> df1.empty()
True
>>> df2.collect()
  ACTUAL PREDICT
0   None    None
>>> df2.empty()
False
filter(condition)

Selects rows that match the given condition.

Very little checking is done on the condition string. Use only with trusted inputs.

Parameters
conditionstr

A filter condition. Format as SQL <condition>.

Returns
DataFrame

A DataFrame with rows that match the given condition.

Raises
hana_ml.ml_exceptions.BadSQLError

If comments or malformed tokens are detected in condition. May have false positives and false negatives.

Examples

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> df.filter('B < 4').collect()
   A  B
0  1  3
has(col)

Returns True if a column is in the DataFrame.

Parameters
colstr

The name of column to search in the projection list of this DataFrame.

Returns
bool

Returns True if the column exists in the DataFrame's projection list.

Examples

>>> df.columns
['A', 'B']
>>> df.has('A')
True
>>> df.has('C')
False
head(n=1)

Returns a DataFrame of the first n rows in the current DataFrame.

Parameters
nint, optional

The number of rows returned.

Defaults to 1.

Returns
DataFrame

A new DataFrame of the first n rows of the current DataFrame.

hasna(cols=None)

Returns True if a DataFrame contains NULLs.

Parameters
colsstr or list of str, optional

A column or list of columns to be checked for NULL values. Defaults to all columns.

Returns
bool

True if this DataFrame contains NULLs.

Examples

>>> df1.collect()
  ACTUAL PREDICT
0   1.0    None
>>> df1.hasna()
True
fillna(value, subset=None)

Returns a DataFrame with NULLs replaced with a specified value.

You can only fill a DataFrame with numeric columns.

Parameters
valueint or float

The value that replaces NULL. value should have a type that is appropriate for the selected columns.

subsetlist of str, optional

A list of columns whose NULL values will be replaced. Defaults to all columns.

Returns
DataFrame

A new DataFrame with the NULL values replaced.

get_table_structure()

Return dict format table structure.

join(other, condition=None, how='inner', select=None)

Returns a new DataFrame that is a join of the current DataFrame with another specified DataFrame.

Parameters
otherDataFrame or list of DataFrame

The DataFrame to join with.

conditionstr, optional

The join predicate. If index has been set, use the index as key to join.

Defaults to None.

how{'inner', 'left', 'right', 'outer', 'cross'}, optional

The type of join. Defaults to 'inner'.

Defaults to 'inner'.

selectlist, optional

If provided, each element specifies a column in the result. A string in the select list should be the name of a column in one of the input DataFrames. A (expression, name) tuple creates a new column with the given name, computed from the given expression.

If this value is not provided, defaults to selecting all columns from both DataFrames, with the left DataFrame's columns first.

Returns
DataFrame

A new DataFrame object made from the join() of the current DataFrame with another DataFrame.

Raises
hana_ml.ml_exceptions.BadSQLError

If comments or malformed tokens are detected in condition or in a column expression. May have false positives and false negatives.

Examples

Use the expression selection functionality to disambiguate duplicate column names in a join():

>>> df1.collect()
   A  B    C
0  1  2  3.5
1  2  4  5.6
2  3  3  1.1
>>> df2.collect()
   A  B     D
0  2  1  14.0
1  3  4   5.6
2  4  3   0.0

Old method:

>>> df1.alias('L').join(df2.alias('R'), 'L.A = R.A', select=[
...     ('L.A', 'A'),
...     ('L.B', 'B_LEFT'),
...     ('R.B', 'B_RIGHT'),
...     'C',
...     'D']).collect()
   A  B_LEFT  B_RIGHT    C     D
0  2       4        1  5.6  14.0
1  3       3        4  1.1   5.6

New method:

>>> df1.set_index("A").join(df2.rename_columns({"B":"B2"}).set_index("A")).collect()
   A  B B2    C   D
0  2  4  1  5.6  14
1  3  3  4  1.1 5.6
set_name(name)

Set the name of the DataFrame.

Parameters
namestr

The name of dataframe.

set_index(keys)

Set the DataFrame using the existing columns.

Parameters
keysstr or list of str

This parameter can be either a single column key or a list of column keys.

save(where, table_type=None, force=False, save_source=True, append=False, data_lake=False, data_lake_container='SYSRDL#CG')

Creates a table or view holding the current DataFrame's data.

Parameters
wherestr or (str, str) tuple

The table name or (schema name, table name) tuple. If no schema is provided, then the table or view is created in the current schema.

table_typestr, optional

The type of table to create. The value is case insensitive.

Permanent table options:
  • "ROW"

  • "COLUMN"

  • "HISTORY COLUMN"

Temporary table options:
  • "GLOBAL TEMPORARY"

  • "GLOBAL TEMPORARY COLUMN"

  • "LOCAL TEMPORARY"

  • "LOCAL TEMPORARY COLUMN"

Not a table:
  • "VIEW"

Defaults to 'LOCAL TEMPORARY COLUMN' if where starts with '#'. Otherwise, the default is 'COLUMN'.

forcebool, optional

If force is True, it will replace the existing table.

save_sourcebool, optional

If True, it will save the name of source table.

Defaults to True.

appendbool, optional

If True, it will use the existing table and append data to it.

Defaults to False.

data_lakebool. optional

If True, it will save the table to HANA data lake.

Defaults to False.

Returns
DataFrame

A DataFrame that represents the new table or view.

Notes

For this operation to succeed, the table name must not be in use, the schema must exist, and the user must have permission to create tables (or views) in the target schema.

save_nativedisktable(where, force=False, save_source=True)

Materialize dataframe to a SAP HANA native disk.

Parameters
wherestr or (str, str) tuple

The table name or (schema name, table name) tuple. If no schema is provided, then the table or view is created in the current schema.

forcebool, optional

If force is True, it will replace the existing table.

save_sourcebool, optional

If True, it will save the name of source table. Defaults to True.

Returns
DataFrame

A DataFrame that represents the new table.

split_column(column, separator, new_column_names)

Returns a new DataFrame with splitted column.

Parameters
columnstr

A column or list of columns to be splitted.

separatorstr

The separator.

new_column_nameslist of str

The splitted column names

Returns
DataFrame

New DataFrame object with splitted columns as specified.

concat_columns(columns, separator)

Returns a new DataFrame with splitted column.

Parameters
columnslist of str

A list of columns to be concatenated.

separatorstr

The separator.

Returns
DataFrame

New DataFrame object with concat column as specified.

sort(cols, desc=False)

Returns a new DataFrame sorted by the specified columns.

Parameters
colsstr or list of str

A column or list of columns to sort by. If a list is specified, then the sort order in parameter desc is used for all columns.

descbool, optional

Set to True to sort in descending order. Defaults to False, for ascending order. Default value is False.

Returns
DataFrame

New DataFrame object with rows sorted as specified.

sort_values(by, ascending=True)

Returns a new DataFrame sorted by the specified columns.

Parameters
bystr or list of str

A column or list of columns to sort by. If a list is specified, then the sort order in parameter desc is used for all columns.

ascendingbool, optional

Set to False to sort in descending order. Defaults to False, for ascending order.

Defaults to True

Returns
DataFrame

New DataFrame object with rows sorted as specified.

sort_index(ascending=True)

Returns a new DataFrame sorted by the index.

Parameters
ascendingbool, optional

Set to False to sort in descending order. Defaults to False, for ascending order.

Defaults to True

Returns
DataFrame

New DataFrame object with rows sorted as specified.

select(*cols)

Returns a new DataFrame with columns derived from the current DataFrame.

Warning

There is no check that inputs interpreted as SQL expressions are actually valid expressions; an "expression" like "A FROM TAB; DROP TABLE IMPORTANT_THINGS; SELECT A" can cause a lot of damage.

Parameters
colsstr or (str, str) tuple.

The columns in the new DataFrame. A string is treated as the name of a column to select; a (str, str) tuple is treated as (SQL expression, alias). As a special case, '*' is expanded to all columns of the original DataFrame.

Returns
DataFrame

A new DataFrame object with the specified columns.

Raises
hana_ml.ml_exceptions.BadSQLError

If comments or malformed tokens are detected in a column expression. May have false positives and false negatives.

Examples

Input:

>>> df.collect()
   A  B  C
0  1  2  3

Selecting a subset of existing columns:

>>> df.select('A', 'B').collect()
   A  B
0  1  2

Computing a new column based on existing columns:

>>> df.select('*', ('B * 4', 'D')).collect()
   A  B  C  D
0  1  2  3  8
union(other, all=True)

Combines this DataFrame's rows and another DataFrame's rows into one DataFrame. This operation is equivalent to a SQL UNION ALL.

Parameters
otherDataFrame, list of DataFrame

The right side of the union.

allbool, optional

If True, keep duplicate rows; equivalent to UNION ALL in SQL. If False, keep only one copy of duplicate rows (even if they come from the same side of the union); equivalent to a UNION or a UNION ALL followed by DISTINCT in SQL. Defaults to True.

Returns
DataFrame

The combined data from self and other.

Examples

We have two DataFrames we want to union, with some duplicate rows:

>>> df1.collect()
   A  B
0  1  2
1  1  2
2  2  3
>>> df2.collect()
   A  B
0  2  3
1  3  4

union() produces a DataFrame that contains all rows of both df1 and df2, like a UNION ALL:

>>> df1.union(df2).collect()
   A  B
0  1  2
1  1  2
2  2  3
3  2  3
4  3  4

To get the deduplication effect of a UNION DISTINCT, pass all=False or call distinct() after union():

>>> df1.union(df2, all=False).collect()
   A  B
0  1  2
1  2  3
2  3  4
>>> df1.union(df2).distinct().collect()
   A  B
0  1  2
1  2  3
2  3  4
collect(fetch_size=32000, geometries=True, convert_geo_to_shapely=True)

Copies the current DataFrame to a new Pandas DataFrame.

Parameters
fetch_sizeint, optional

Fetch size in hdbcli.

geometriesbool, optional

With this flag set to True (default), the geometries are converted to Well-Known-Text representations in the resulting Pandas dataframe. Even if they are converted to Shapely objects (see convert_geo_to_shapely), when you print the dataframe, the geometry columns are represented as Well-Known-Text.

If you need the raw binary values, set this flag to False

convert_geo_to_shapelybool, optional

If set to True (default), all geometry columns will be converted to a Shapely object, so that the dataframe can be directly used in visualization libraries for example. If your processing does not support Shapely objects, you can switch this conversion off. In this case the columns reamin of type String.

Note: Before the conversion, it is checked, if any value in the column is Null. If so, this column will not be converted to a Shapely object, because Null WKT strings are not supported by Shapely.

Returns
pandas.DataFrame

A Pandas DataFrame that contains the current DataFrame's data.

Examples

Viewing a hana_ml DataFrame doesn't execute the underlying SQL or fetch the data:

>>> df = cc.table('T')
>>> df
<hana_ml.dataframe.DataFrame object at 0x7f2b7f24ddd8>

Using collect() executes the SQL and fetchs the results into a Pandas DataFrame:

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> type(df.collect())
<class 'pandas.core.frame.DataFrame'>
property geometries: list

Returns the geometries of a data frame. The list is empty if there are none.

Returns
list

List with geometry columns

property srids: dict

Returns the srid for each geometry column in the dataframe. If none is found, the dictionary will be {}.

For dataframes based on HANA catalog objects, the information is read from the catalog. For Dataframes, which do not have a catalog object (e.g. are based on SQL statements, or temporary tables ), the SRID is derived by selecting the first row in the table and read it directly from the EWKB. For columns with multiple SRSes (SRID NULL), this means, that you might get back a SRS, which differs from other entries in the same column.

Known Limitation: For dataframes which don't have catalog objects and do not contain data, no SRID can be provided.

Returns
dict

Dictionary with the SRID per column: {<column_name>: <srid>} Returns {} when none are found.

rename_columns(names)

Returns a DataFrame with renamed columns.

Parameters
nameslist or dict

If a list, specifies new names for every column in this DataFrame. If a dict, each dict entry maps an existing name to a new name, and not all columns need to be renamed.

Returns
DataFrame

The same data as the original DataFrame with new column names.

See also

DataFrame.alias

For renaming the DataFrame itself.

Examples

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> df.rename_columns(['C', 'D']).collect()
   C  D
0  1  3
1  2  4
>>> df.rename_columns({'B': 'D'}).collect()
   A  D
0  1  3
1  2  4
cast(cols, new_type)

Returns a DataFrame with columns cast to a new type.

The name of the column in the returned DataFrame is the same as the original column.

Warning

Type compatibility between the existing column type and the new type is not checked. An incompatibility results in an error.

Parameters
colsstr or list

The column(s) to be cast to a different type.

new_typestr

The database datatype to cast the column(s) to. No checks are performed to see if the new type is valid. An invalid type can lead to SQL errors or even SQL injection vulnerabilities.

Returns
DataFrame

The same data as this DataFrame, but with columns cast to the specified type.

Examples

Input:

>>> df1 = cc.sql('SELECT "AGE", "PDAYS", "HOUSING" FROM DBM_TRAINING_TBL')
>>> df1.dtypes()
[('AGE', 'INT', 10, 10, 10, 0), ('PDAYS', 'INT', 10, 10, 10, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]

Casting a column to NVARCHAR(20):

>>> df2 = df1.cast('AGE', 'NVARCHAR(20)')
>>> df2.dtypes()
[('AGE', 'NVARCHAR', 20, 20, 20, 0), ('PDAYS', 'INT', 10, 10, 10, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]

Casting a list of columns to NVARCHAR(50):

>>> df3 = df1.cast(['AGE', 'PDAYS'], 'NVARCHAR(50)')
>>> df3.dtypes()
[('AGE', 'NVARCHAR', 50, 50, 50, 0), ('PDAYS', 'NVARCHAR', 50, 50, 50, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]
to_head(col)

Returns a DataFrame with specified column as the first item in the projection.

Parameters
colstr

The column to move to the first position.

Returns
DataFrame

The same data as this DataFrame but with the specified column in the first position.

Examples

Input:

>>> df1 = cc.table("DBM_TRAINING")
>>> import pprint
>>> pprint.pprint(df1.columns)
['ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED',
 'LABEL']

Moving the column 'LABEL' to head:

>>> df2 = df1.to_head('LABEL')
>>> pprint.pprint(df2.columns)
['LABEL',
 'ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED']
describe(cols=None)

Returns a DataFrame that contains various statistics for the requested column(s).

Parameters
colsstr or list, optional

The column(s) to be described. Defaults to all columns.

Returns
DataFrame

A DataFrame that contains statistics for the specified column(s) in the current DataFrame.

The statistics included are the count of rows ("count"),

number of distinct values ("unique"),

number of nulls ("nulls"),

average ("mean"),

standard deviation("std")

median ("median"),

minimum value ("min"),

maximum value ("max"),

25% percentile when treated as continuous variable ("25_percent_cont"),

25% percentile when treated as discrete variable ("25_percent_disc"),

50% percentile when treated as continuous variable ("50_percent_cont"),

50% percentile when treated as discrete variable ("50_percent_disc"),

75% percentile when treated as continuous variable ("75_percent_cont"),

75% percentile when treated as discrete variable ("75_percent_disc").

For columns that are strings, statistics such as average ("mean"), standard deviation ("std"), median ("median"), and the various percentiles are NULLs.

If the list of columns contain both string and numeric data types, minimum and maximum values become NULLs.

Examples

Input:

>>> df1 = cc.table("DBM_TRAINING")
>>> import pprint
>>> pprint.pprint(df2.columns)
['LABEL',
 'ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED']

Describe a few numeric columns and collect them to return a Pandas DataFrame:

>>> df1.describe(['AGE', 'PDAYS']).collect()
  column  count  unique  nulls        mean         std  min  max  median
0    AGE  16895      78      0   40.051376   10.716907   17   98      38
1  PDAYS  16895      24      0  944.406688  226.331944    0  999     999
   25_percent_cont  25_percent_disc  50_percent_cont  50_percent_disc
0             32.0               32             38.0               38
1            999.0              999            999.0              999
   75_percent_cont  75_percent_disc
0             47.0               47
1            999.0              999

Describe some non-numeric columns and collect them to return a Pandas DataFrame:

>>> df1.describe(['JOB', 'MARITAL']).collect()
    column  count  unique  nulls  mean   std       min      max median
0      JOB  16895      12      0  None  None    admin.  unknown   None
1  MARITAL  16895       4      0  None  None  divorced  unknown   None
  25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc
0            None            None            None            None
1            None            None            None            None
  75_percent_cont 75_percent_disc
0            None            None
1            None            None

Describe all columns in a DataFrame:

>>> df1.describe().collect()
            column  count  unique  nulls          mean           std
0               ID  16895   16895      0  21282.286652  12209.759725
1              AGE  16895      78      0     40.051376     10.716907
2         DURATION  16895    1267      0    263.965670    264.331384
3         CAMPAIGN  16895      35      0      2.344658      2.428449
4            PDAYS  16895      24      0    944.406688    226.331944
5         PREVIOUS  16895       7      0      0.209529      0.539450
6     EMP_VAR_RATE  16895      10      0     -0.038798      1.621945
7   CONS_PRICE_IDX  16895      26      0     93.538844      0.579189
8    CONS_CONF_IDX  16895      26      0    -40.334123      4.865720
9        EURIBOR3M  16895     283      0      3.499297      1.777986
10      NREMPLOYED  16895      11      0   5160.371885     75.320580
11             JOB  16895      12      0           NaN           NaN
12         MARITAL  16895       4      0           NaN           NaN
13       EDUCATION  16895       8      0           NaN           NaN
14     DBM_DEFAULT  16895       2      0           NaN           NaN
15         HOUSING  16895       3      0           NaN           NaN
16            LOAN  16895       3      0           NaN           NaN
17         CONTACT  16895       2      0           NaN           NaN
18       DBM_MONTH  16895      10      0           NaN           NaN
19     DAY_OF_WEEK  16895       5      0           NaN           NaN
20        POUTCOME  16895       3      0           NaN           NaN
21           LABEL  16895       2      0           NaN           NaN
         min        max     median  25_percent_cont  25_percent_disc
0      5.000  41187.000  21786.000        10583.500        10583.000
1     17.000     98.000     38.000           32.000           32.000
2      0.000   4918.000    184.000          107.000          107.000
3      1.000     43.000      2.000            1.000            1.000
4      0.000    999.000    999.000          999.000          999.000
5      0.000      6.000      0.000            0.000            0.000
6     -3.400      1.400      1.100           -1.800           -1.800
7     92.201     94.767     93.444           93.075           93.075
8    -50.800    -26.900    -41.800          -42.700          -42.700
9      0.634      5.045      4.856            1.313            1.313
10  4963.000   5228.000   5191.000         5099.000         5099.000
11       NaN        NaN        NaN              NaN              NaN
12       NaN        NaN        NaN              NaN              NaN
13       NaN        NaN        NaN              NaN              NaN
14       NaN        NaN        NaN              NaN              NaN
15       NaN        NaN        NaN              NaN              NaN
16       NaN        NaN        NaN              NaN              NaN
17       NaN        NaN        NaN              NaN              NaN
18       NaN        NaN        NaN              NaN              NaN
19       NaN        NaN        NaN              NaN              NaN
20       NaN        NaN        NaN              NaN              NaN
21       NaN        NaN        NaN              NaN              NaN
    50_percent_cont  50_percent_disc  75_percent_cont  75_percent_disc
0         21786.000        21786.000        32067.500        32068.000
1            38.000           38.000           47.000           47.000
2           184.000          184.000          324.000          324.000
3             2.000            2.000            3.000            3.000
4           999.000          999.000          999.000          999.000
5             0.000            0.000            0.000            0.000
6             1.100            1.100            1.400            1.400
7            93.444           93.444           93.994           93.994
8           -41.800          -41.800          -36.400          -36.400
9             4.856            4.856            4.961            4.961
10         5191.000         5191.000         5228.000         5228.000
11              NaN              NaN              NaN              NaN
12              NaN              NaN              NaN              NaN
13              NaN              NaN              NaN              NaN
14              NaN              NaN              NaN              NaN
15              NaN              NaN              NaN              NaN
16              NaN              NaN              NaN              NaN
17              NaN              NaN              NaN              NaN
18              NaN              NaN              NaN              NaN
19              NaN              NaN              NaN              NaN
20              NaN              NaN              NaN              NaN
21              NaN              NaN              NaN              NaN
bin(col, strategy='uniform_number', bins=None, bin_width=None, bin_column='BIN_NUMBER')

Returns a DataFrame with the original columns as well as bin assignments.

The name of the columns in the returned DataFrame is the same as the original column. Column "BIN_NUMBER" or the specified value in bin_column is added and corresponds to the bin assigned.

Parameters
colstr

The column on which binning is performed. The column must be numeric.

strategy{'uniform_number', 'uniform_size'}, optional

Binning methods:

  • 'uniform_number': Equal widths based on the number of bins.

  • 'uniform_size': Equal widths based on the bin size.

Default value is 'uniform_number'.

binsint, optional

The number of equal-width bins. Only valid when strategy is 'uniform_number'.

Defaults to 10.

bin_widthint, optional

The interval width of each bin. Only valid when strategy is 'uniform_size'.

bin_columnstr, optional

The name of the output column that contains the bin number.

Returns
DataFrame

A binned dataset with the same data as this DataFrame, as well as an additional column "BIN_NUMBER" or the value specified in bin_column. This additional column contains the assigned bin for each row.

Examples

Input:

>>> df.collect()
   C1   C2    C3       C4
0   1  1.2   2.0      1.0
1   2  1.4   4.0      3.0
2   3  1.6   6.0      9.0
3   4  1.8   8.0     27.0
4   5  2.0  10.0     81.0
5   6  2.2  12.0    243.0
6   7  2.4  14.0    729.0
7   8  2.6  16.0   2187.0
8   9  2.8  18.0   6561.0
9  10  3.0  20.0  19683.0

Create five bins of equal widths on C1:

>>> df.bin('C1', strategy='uniform_number', bins=5).collect()
   C1   C2    C3       C4  BIN_NUMBER
0   1  1.2   2.0      1.0           1
1   2  1.4   4.0      3.0           1
2   3  1.6   6.0      9.0           2
3   4  1.8   8.0     27.0           2
4   5  2.0  10.0     81.0           3
5   6  2.2  12.0    243.0           3
6   7  2.4  14.0    729.0           4
7   8  2.6  16.0   2187.0           4
8   9  2.8  18.0   6561.0           5
9  10  3.0  20.0  19683.0           5

Create five bins of equal widths on C2:

>>> df.bin('C3', strategy='uniform_number', bins=5).collect()
   C1   C2    C3       C4  BIN_NUMBER
0   1  1.2   2.0      1.0           1
1   2  1.4   4.0      3.0           1
2   3  1.6   6.0      9.0           2
3   4  1.8   8.0     27.0           2
4   5  2.0  10.0     81.0           3
5   6  2.2  12.0    243.0           3
6   7  2.4  14.0    729.0           4
7   8  2.6  16.0   2187.0           4
8   9  2.8  18.0   6561.0           5
9  10  3.0  20.0  19683.0           5

Create five bins of equal widths on a column that varies significantly:

>>> df.bin('C4', strategy='uniform_number', bins=5).collect()
   C1   C2    C3       C4  BIN_NUMBER
0   1  1.2   2.0      1.0           1
1   2  1.4   4.0      3.0           1
2   3  1.6   6.0      9.0           1
3   4  1.8   8.0     27.0           1
4   5  2.0  10.0     81.0           1
5   6  2.2  12.0    243.0           1
6   7  2.4  14.0    729.0           1
7   8  2.6  16.0   2187.0           1
8   9  2.8  18.0   6561.0           2
9  10  3.0  20.0  19683.0           5

Create bins of equal width:

>>> df.bin('C1', strategy='uniform_size', bin_width=3).collect()
   C1   C2    C3       C4  BIN_NUMBER
0   1  1.2   2.0      1.0           1
1   2  1.4   4.0      3.0           1
2   3  1.6   6.0      9.0           2
3   4  1.8   8.0     27.0           2
4   5  2.0  10.0     81.0           2
5   6  2.2  12.0    243.0           3
6   7  2.4  14.0    729.0           3
7   8  2.6  16.0   2187.0           3
8   9  2.8  18.0   6561.0           4
9  10  3.0  20.0  19683.0           4
agg(agg_list, group_by=None)

Returns a SAP HANA DataFrame with the group_by column along with the aggregates. This method supports all aggregation functions in the SAP HANA system, such as 'max', 'min', 'count', 'avg', 'sum', 'median', 'stddev', 'var'. The name of the column in the returned DataFrame is the same as the original column.

Aggregation functions can be referred to SAP HANA aggregate functions.

Parameters
agg_listA list of tuples

A list of tuples. Each tuple is a triplet. The triplet consists of (aggregate_operator, expression, name) where:

  • aggregate_operator is one of ['max', 'min', 'count', 'avg', 'sum', 'median', 'stddev', 'var', ...]. The operator name is identical to SAP HANA sql naming and we support all aggregation functions in the SAP HANA system.

  • expression is a str that is a column or column expression

  • name is the name of this aggregate in the project list.

group_bystr or list of str, optional

The group by column. Only a column is allowed although expressions are allowed in SQL. To group by an expression, create a DataFrame by providing the entire SQL. So, if you have a table T with columns C1, C2, and C3 that are all integers, to calculate the max(C1) grouped by (C2+C3) a DataFrame would need to be created as below:

cc.sql('SELECT "C2"+"C3", max("C1") FROM "T" GROUP BY "C2"+"C3"')

Returns
DataFrame

A DataFrame containing the group_by column (if it exists), as well as the aggregate expressions that are aliased with the specified names.

Examples

Input:

>>> df.collect()
    ID  SEPALLENGTHCM  SEPALWIDTHCM  PETALLENGTHCM  PETALWIDTHCM          SPECIES
0    1            5.1           3.5            1.4           0.2      Iris-setosa
1    2            4.9           3.0            1.4           0.2      Iris-setosa
2    3            4.7           3.2            1.3           0.2      Iris-setosa
3   51            7.0           3.2            4.7           1.4  Iris-versicolor
4   52            6.4           3.2            4.5           1.5  Iris-versicolor
5  101            6.3           3.3            6.0           2.5   Iris-virginica
6  102            5.8           2.7            5.1           1.9   Iris-virginica
7  103            7.1           3.0            5.9           2.1   Iris-virginica
8  104            6.3           2.9            5.6           1.8   Iris-virginica

Another way to do a count:

>>> df.agg([('count', 'SPECIES', 'COUNT')]).collect()
    COUNT
0      9

Get counts by SPECIES:

>>> df.agg([('count', 'SPECIES', 'COUNT')], group_by='SPECIES').collect()
           SPECIES  COUNT
0  Iris-versicolor      2
1   Iris-virginica      4
2      Iris-setosa      3

Get max values of SEPALLENGTHCM by SPECIES:

>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH')], group_by='SPECIES').collect()
           SPECIES  MAX_SEPAL_LENGTH
0  Iris-versicolor               7.0
1   Iris-virginica               7.1
2      Iris-setosa               5.1

Get max and min values of SEPALLENGTHCM by SPECIES:

>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH'),
    ('min', 'SEPALLENGTHCM', 'MIN_SEPAL_LENGTH')], group_by=['SPECIES']).collect()
           SPECIES  MAX_SEPAL_LENGTH  MIN_SEPAL_LENGTH
0  Iris-versicolor               7.0               6.4
1   Iris-virginica               7.1               5.8
2      Iris-setosa               5.1               4.7

Get aggregate grouping by multiple columns:

>>> df.agg([('count', 'SEPALLENGTHCM', 'COUNT_SEPAL_LENGTH')],
            group_by=['SPECIES', 'PETALLENGTHCM']).collect()
           SPECIES  PETALLENGTHCM  COUNT_SEPAL_LENGTH
0   Iris-virginica            6.0                   1
1      Iris-setosa            1.3                   1
2   Iris-virginica            5.9                   1
3   Iris-virginica            5.6                   1
4      Iris-setosa            1.4                   2
5  Iris-versicolor            4.7                   1
6  Iris-versicolor            4.5                   1
7   Iris-virginica            5.1                   1
is_numeric(cols=None)

Returns True if the column(s) in the DataFrame are numeric.

Parameters
colsstr or list, optional

The column(s) to be tested for being numeric.

Defaults to all columns.

Returns
bool

True if all the columns are numeric.

Examples

Input:

>>> df.head(5).collect()
ID  SEPALLENGTHCM  SEPALWIDTHCM  PETALLENGTHCM  PETALWIDTHCM      SPECIES
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa
>>> pprint.pprint(df.dtypes())
[('ID', 'INT', 10, 10, 10, 0),
 ('SEPALLENGTHCM', 'DOUBLE', 15, 15, 15, 0),
 ('SEPALWIDTHCM', 'DOUBLE', 15, 15, 15, 0),
 ('PETALLENGTHCM', 'DOUBLE', 15, 15, 15, 0),
 ('PETALWIDTHCM', 'DOUBLE', 15, 15, 15, 0),
 ('SPECIES', 'NVARCHAR', 15, 15, 15, 0)]

Test a single column:

>>> df.is_numeric('ID')
True
>>> df.is_numeric('SEPALLENGTHCM')
True
>>> df.is_numeric(['SPECIES'])
False

Test a list of columns:

>>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'PETALWIDTHCM'])
True
>>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'SPECIES'])
False
corr(first_col, second_col)

Returns a DataFrame that gives the correlation coefficient between two numeric columns.

All rows with NULL values for first_col or second_col are removed prior to calculating the correlation coefficient.

The correlation coefficient is:

1/(n-1) * sum((col1_value - avg(col1)) * (col2 - avg(col2))) / (stddev(col1) * stddev(col2))

Parameters
first_colstr

The first column for calculating the correlation coefficient.

second_colstr

The second column for calculating the correlation coefficient.

Returns
DataFrame

A DataFrame with one value that contains the correlation coefficient. The name of the column is CORR_COEFF.

Examples

Input:

>>> df.columns
['C1', 'C2', 'C3', 'C4']
>>> df.collect()
   C1   C2      C3       C4
0   1  1.2     2.0      1.0
1   2  1.4     4.0      3.0
2   3  1.6     8.0      9.0
3   4  1.8    16.0     27.0
4   5  2.0    32.0     81.0
5   6  2.2    64.0    243.0
6   7  2.4   128.0    729.0
7   8  2.6   256.0   2187.0
8   9  2.8   512.0   6561.0
9  10  3.0  1024.0  19683.0

Correlation with columns that are well correlated:

>>> df.corr('C1', 'C2').collect()
      CORR_COEFF
0         1.0
>>> df.corr('C1', 'C3').collect()
    CORR_COEFF
0         1.0

Correlation with a column whose value is three times its previous value:

>>> df.corr('C1', 'C4').collect()
    CORR_COEFF
0    0.696325
min()

Get the minimum value of the columns. It simplies the use of agg function.

Returns
scalar or Series
max()

Get the maximum value of the columns. It simplies the use of agg function.

Returns
scalar or Series
sum()

Get the summation of the columns. It simplies the use of agg function.

Returns
scalar or Series
median()

Get the median value of the columns. It simplies the use of agg function.

Returns
scalar or Series
mean()

Get the mean value of the columns. It simplies the use of agg function.

Returns
scalar or Series
value_counts(subset=None)

Get the value counts of the columns. It simplies the use of agg function.

Parameters
subsetlist, optional

Columns to use when counting unique combinations.

Returns
DataFrame
pivot_table(values, index, columns, aggfunc='avg')

Returns a DataFrame that gives the pivoted table.

aggfunc is identical to SAP HANA aggregate functions:

https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/6fff7f0ae9184d1db47a25791545a1b6.html

Parameters
valuesstr or list of str

The targeted values for pivoting.

indexstr or list of str

The index of the DataFrame.

columnsstr or list of str

The pivoting columns.

aggfunc{'avg', 'max', 'min',... }, optional

aggfunc is identical to SAP HANA aggregate functions. Defaults to 'avg'.

Returns
DataFrame

A pivoted DataFrame.

Examples

df is a SAP HANA DataFrame.

>>> df.pivot_table(values='C2', index='C1', columns='C3', aggfunc='max')
generate_table_type()

Generates a SAP HANA table type based on the dtypes function of the DataFrame. This is a convenience method for SQL tracing.

Parameters
None
Returns
str

The table type in string form.

set_source_table(table, schema=None)

Specify the source table for the current dataframe.

Parameters
tablestr

The table name.

schemastr, optional, keyword-only

The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

to_pickle(path, compression='infer', protocol=4)

Pickle object to file.

Parameters
pathstr

File path where the pickled object will be stored.

compression{'infer', 'gzip', 'bz2', 'zip', 'xz', None}

A string representing the compression to use in the output file. By default, infers from the file extension in specified path.

Defaults to 'infer'.

protocolint

Int which indicates which protocol should be used by the pickler, default HIGHEST_PROTOCOL.

hana_ml.dataframe.read_pickle(connection_context, path, table_name, compression='infer', schema=None, force=False, replace=True, object_type_as_bin=False, table_structure=None, **kwargs)

Load pickled DataFrame object from file.

Parameters
connection_contextConnectionContext

A connection to the SAP HANA system.

pathstr

File path where the pickled object will be loaded.

table_namestr

The table name in the SAP HANA database.

compression{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, optional

For on-the-fly decompression of on-disk data. If 'infer', then use gzip, bz2, xz or zip if path ends in '.gz', '.bz2', '.xz', or '.zip' respectively, and no decompression otherwise. Set to None for no decompression.

Defaults to 'infer'.

schemastr, optional, keyword-only

The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

forcebool, optional

If force is True, then the SAP HANA table with table_name is dropped.

Defaults to False.

replacebool, optional

If replace is True, then the SAP HANA table performs the missing value handling.

Defaults to True.

object_type_as_binbool, optional

If True, the object type will be considered CLOB in SAP HANA.

Defaults to False.

table_structuredict, optional

Manually define column types based on SAP HANA DB table structure.

Defaults to None.

hana_ml.dataframe.create_dataframe_from_pandas(connection_context, pandas_df, table_name, schema=None, force=False, replace=False, object_type_as_bin=False, table_structure=None, drop_exist_tab=True, allow_bigint=False, geo_cols: Optional[list] = None, srid: int = 4326, primary_key: Optional[str] = None, not_nulls: Optional[list] = None, chunk_size=50000, disable_progressbar=False)

Uploads data from a Pandas DataFrame to a SAP HANA database and returns an SAP HANA DataFrame.

Parameters
connection_contextConnectionContext

A connection to the SAP HANA system.

pandas_dfpandas.DataFrame

A Pandas DataFrame for uploading to the SAP HANA database. This can also be a GeoPandas datframe, which will automatically be converted to a Pandas DataFrame

table_namestr

The table name in the SAP HANA database.

schemastr, optional, keyword-only

The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

forcebool, optional

If force is True, then the SAP HANA table with table_name is truncated or dropped.

Defaults to False.

replacebool, optional

If replace is True, then the SAP HANA table performs the missing value handling.

Defaults to False.

object_type_as_binbool, optional

If True, the object type will be considered CLOB in SAP HANA.

Defaults to False.

table_structuredict

Manually define column types based on SAP HANA DB table structure.

drop_exist_tabbool, optional

If force is True, drop the existing table when drop_exist_tab is True and truncate the existing table when it is False.

Defaults to True.

allow_bigintbool, optional

allow_bigint decides whether int64 is mapped into INT or BIGINT in HANA.

Defaults to False.

geo_colslist, optional but required for spatial functions

Specifies the columns of the dataframe, which are treated as geometries. List elements can be either strings or tuples.

Strings represent columns which contain geometries in (E)WKT format. If the provided DataFrame is a GeoPandas DataFrame, you do not need to add the geometry column to the geo_cols. It will be detected and added automatically.

The column name in the HANA Table will be <column_name>_GEO

Tuples must consist of two or strings: (<longitude column>, <latitude column>)

longitude column: Dataframe column, that contains the longitude values

latitude column: Dataframe column, that contains the latitude values

They will be combined to a POINT(<longiturd> <latitude>) geometry.

The column name in the HANA Table will be <longitude>_<latitude>_GEO

Defaults to None.

sridint, optional but required for spatial functions

Spatial reference system id. If the SRS is not created yet, the system tries to create it automatically.

Defaults to 4326.

primary_keystr, optional but required for Graph functions

Name of the column in a node table which contains the unique identification of the node and corresponds with the edge table.

Defaults to None.

not_nullslist, optional but required for Graph functions

Contains all column names which should get a not null constraint. This is primarily for creating node and edge tables for a graph workspace

Defaults to None.

chunk_sizeint, optional

Specify the chunk size for upload.

Defaults to 50000.

disable_progressbarbool, optional

Disable the progressbar.

Defaults to False.

Returns
DataFrame

A SAP HANA DataFrame that contains the data in the pandas_df.

Examples

>>> create_dataframe_from_pandas(connection_context,p_df,'test',force=False,replace=True,drop_exist_tab=False)
<hana_ml.dataframe.DataFrame at 0x7efbcb26fbe0>
>>> create_dataframe_from_pandas(
        connection_context,
        p_df,
        "geo_table",
        force=False,
        replace=True,
        drop_exist_tab=False,
        geo_cols=["geometry", ("long", "lat")],
    )
<hana_ml.dataframe.DataFrame at 0x5eabcb27fbe0>
hana_ml.dataframe.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=None)

Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.

Parameters
frameDataFrame

A SAP HANA DataFrame.

id_varsstr, tuple or list, optional

Column(s) to use as identifier variables.

Defaults to None.

value_varstuple or list, optional

Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

var_namescalar, optional

Name to use for the 'variable' column. If None it uses frame.columns.name or 'variable'.

value_namescalar, default 'value', optional

Name to use for the 'value' column.

Returns
DataFrame

Unpivoted DataFrame.

Examples

>>> data.collect()
    A B C
 0  a 1 2
 1  b 3 4
 2  c 5 6
>>> melt(data, id_vars=['A'], value_vars=['B', 'C']).collect()
    A variable value
 0  a        B     1
 1  a        C     2
 2  b        B     5
 3  b        C     6
 4  c        B     3
 5  c        C     4
hana_ml.dataframe.create_dataframe_from_shapefile(connection_context: hana_ml.dataframe.ConnectionContext, shp_file: str, srid: int, table_name: str, schema: Optional[str] = None) hana_ml.dataframe.DataFrame

Given a shapefile change the file into a DataFrame so that it is backed in SAP HANA. Expects that the shapefile name is a zip and/or will have both shp and dbf parts to create the single table. Once the table is created temporarily and locally, import the table as a shapefile into the target table through direct insertion. Then return the SAP HANA Dataframe that can be visualized.

Parameters
connection_contextConnectionContext

A connection to the SAP HANA system.

shp_filestr

Path to a zipfile, shapefile or dbf-file. Filename suffix will be dropped and attempt to load both dbf and shp by that base name. Allowed suffixes are: .zip, .shp, .shx, and .dbf

sridint

The spatial reference id that applies to the list of columns in geo_cols. If the SRS is not created yet, the system tries to create it automatically.

table_namestr

The table name in the SAP HANA database.

schemastr, optional, keyword-only

The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

Returns
DataFrame

A SAP HANA DataFrame with geometric columns that contains the data from the shp_file.

Examples

>>> cc = connection_context
>>> shapefile_path = os.path.join(os.getcwd(), 'myshape.shp')
>>> hana_df = create_dataframe_from_shapefile(
...   connection_context=cc,
...   shp_file=shapefile_path,
...   srid=4326,
...   table_name="myshape_tbl")