hana_ml.dataframe
This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly requested.
The following classes and functions are available:
- hana_ml.dataframe.quotename(name)
Escapes a schema, table, or column name for use in SQL. hana_ml functions and methods that take schema, table, or column names already escape their input by default, but those that take SQL don't (and can't) perform escaping automatically.
- Parameters:
- namestr
The schema, table, or column name.
- Returns:
- str
The escaped name. The string is surrounded in quotation marks, and existing quotation marks are escaped by doubling them.
- class hana_ml.dataframe.ConnectionContext(address='', port=0, user='', password=None, autocommit=True, packetsize=None, userkey=None, spatialtypes=1, encrypt=None, sslValidateCertificate=None, pyodbc_connection=None, abap_sql=False, sslKeyStore=None, **properties)
Bases:
object
Represents a connection to a SAP HANA database instance.
ConnectionContext includes methods for creating DataFrames from data on the SAP HANA. DataFrames are tied to a ConnectionContext, and are unusable once their ConnectionContext is closed.
- Parameters:
- Same as hdbcli.dbapi.connect.
Please see the online docs for hdbcli.dbapi.connect for more details.
Examples
Querying data from SAP HANA into a Pandas DataFrame:
>>> with ConnectionContext('address', port, 'user', 'password') as cc: ... df = (cc.table('MY_TABLE', schema='MY_SCHEMA') ... .filter('COL3 > 5') ... .select('COL1', 'COL2')) ... pandas_df = df.collect()
The underlying hdbcli.dbapi.connect can be accessed if necessary:
>>> with ConnectionContext('127.0.0.1', 30215, 'MLGUY', 'manager') as cc: ... cc.connection.setclientinfo('SOMEKEY', 'somevalue') ... df = cc.sql('some sql that needs that session variable') ... ...
- Attributes:
- connectionhdbcli.dbapi.connect
The underlying dbapi connection. Use this connection to run SQL directly, or to access connection methods like getclientinfo/setclientinfo.
- enable_abap_sql()
Enables ABAP SQL.
- disable_abap_sql()
Disables ABAP SQL.
- close()
Closes the existing connection to a SAP HANA database instance.
- Parameters:
- None
- Returns:
- None
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.close()
- add_primary_key(table, columns, schema=None)
Adds primary key to the existing table.
- Parameters:
- tablestr
Table name.
- columnsstr of list
Columns to be primary key.
- schemastr, optional
Schema name. If None, use the current schema.
- copy()
Returns a new connection context.
- create_schema(schema)
Creates a SAP HANA schema.
- Parameters:
- schemastr
Schema name.
- create_table(table, table_structure, schema=None, table_type='COLUMN', prop='', data_lake=False, data_lake_container='SYSRDL#CG')
Creates a SAP HANA table.
- Parameters:
- tablestr
Table name.
- table_structuredict
SAP HANA table structure. {Column name: Column type, ...}
- schemastr, optional
Schema name. If None, use the current schema.
Defaults to None.
- table_typestr, optional
Specify the table type.
'COLUMN', by default.
'ROW'.
'TEMPORARY'.
- data_lakebool, optional
If True, create the data lake table by using SYSRDL#CG.REMOTE_EXECUTE().
Defaults to False.
- data_lake_containerstr, optional
Name of the data lake container.
Defaults to 'SYSRDL#CG'.
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.create_table(table='TEST_TBL', table_structure={'test_name': 'VARCHAR(50)'})
- create_virtual_table(table, data_lake_table=None, schema=None, data_lake_container='SYSRDL#CG', remote_source=None, remote_database='NULL', remote_schema=None, remote_table=None)
Creates a SAP virtual HANA table at remote/data lake source.
- Parameters:
- tablestr
HANA virtual table name.
- data_lake_tablestr, optional
HANA data lake table name.
Mandatory if
remote_table
is None.- schemastr, optional
Schema name. If None, use the current schema.
Defaults to None.
- data_lake_containerstr, optional
Name of the data lake container.
Defaults to 'SYSRDL#CG'.
- remote_sourcestr, optional
Remote source where data of the target virtual table reside.
Mandatory and valid only if
data_lake_table
is None.- remote_databasestr, optional
Database of the remote source where data of the target virtual table reside.
Defaults to 'NULL', i.e. the default database.
Valid only if
data_lake_table
is None.- remote_schemastr, optional
The schema under which the corresponding
remote_table
of the target SAP HANA virtual table reside.Required if
data_lake_table
is None.Defaults to None.
- remote_tablestr, optional
The table name in remote source where data of the target virtual table reside.
Mandatory and valid only when
data_lake_table
is None.Defaults to None.
- drop_procedure(proc, schema=None, drop_option=None)
Drops the specified view.
- Parameters:
- procstr
Procedure name.
- schemastr, optional
Schema name. If None, use the current schema.
- drop_option{None, 'CASCADE', 'RESTRICT'}, optional
Specifies the drop option to use.
Defaults to None.
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.drop_proc(proc='TEST_PROC')
- drop_view(view, schema=None, drop_option=None)
Drops the specified view.
- Parameters:
- viewstr
View name.
- schemastr, optional
Schema name. If None, use the current schema.
- drop_option{None, 'CASCADE', 'RESTRICT'}, optional
Specifies the drop option to use.
Defaults to None.
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.drop_view(view='TEST_VIEW')
- drop_table(table, schema=None, data_lake=False, data_lake_container='SYSRDL#CG', drop_option=None)
Drops the specified table.
- Parameters:
- tablestr
Table name.
- schemastr, optional
Schema name. If None, use the current schema.
- data_lakebool, optional
If True, drop the data lake table.
Defaults to False.
- data_lake_containerstr, optional
Name of the data lake container.
Defaults to 'SYSRDL#CG'.
- drop_option{None, 'CASCADE', 'RESTRICT'}, optional
Specifies the drop option to use.
Defaults to None.
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.drop_table(table='TEST_TBL')
- copy_to_data_lake(data, virtual_table, data_lake_table, schema=None, append=False, data_lake_container='SYSRDL#CG')
Copies HANA data to a data lake table.
- Parameters:
- dataDataFrame
HANA DataFrame.
- virtual_tablestr
HANA virtual table name.
- data_lake_tablestr
HANA data lake table name.
- schemastr, optional
Schema name. If None, use the current schema.
- appendbool, optional
Append data to the existing data lake table.
Defaults to False.
- data_lake_containerstr, optional
Name of the data lake container.
Defaults to 'SYSRDL#CG'.
- explain_plan_statement(statement_name, subquery, force=True)
Evaluates the execution plan that the database follows when executing an SQL statement and return the result.
- Parameters:
- statement_namestr,
Specifies the name of a specific execution plan in the output table for a given SQL.
- subquerystr,
Specifies the subquery to explain the plan for.
- forcebool, optional
If force is True, it will delete existing result according to statement_name.
Defaults to True.
- has_schema(schema)
Returns the boolean value for the schema existence.
- Parameters:
- schemastr
Schema name.
- Returns:
- bool
Table existence.
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.has_schema(schema='MY_SCHEMA') True
- has_table(table, schema=None)
Returns the boolean value for the table existence.
- Parameters:
- tablestr
Table name.
- schemastr, optional
Schema name. If None, use the current schema.
- Returns:
- bool
Table existence.
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.has_table(table='TEST_TBL') True
- hana_version()
Returns the version of a SAP HANA database instance.
- Parameters:
- None
- Returns:
- str
The SAP HANA version.
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.hana_version() '4.50.000.00.1581545459 (master)'
- get_current_schema()
Returns the current schema name.
- Parameters:
- None
- Returns:
- str
The current schema name.
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.get_current_schema() 'TEST'
- get_tables(schema=None)
Returns the tables list given schema.
- Parameters:
- schemastr, optional
The schema name. If no specified, use the current schema.
- get_procedures(schema=None)
Returns the procedures list given schema.
- Parameters:
- schemastr, optional
The schema name. If no specified, use the current schema.
- get_temporary_tables(schema=None, connection_id=None, list_other_connections=False)
Returns the temporary table list given schema.
- Parameters:
- schemastr, optional
The schema name. If no specified, use the current schema.
- conneciont_idint, optional
If None, it returns the temporary tables from the current connection.
- list_other_connectionsbool, optional
If True, it will also outputs the temporary tables from other connections.
- get_connection_id()
Returns the connection id.
- cancel_session_process(connection_id=None)
Cancels the current process in the given session. If the connection_id is not provided, it will use the current connection.
- Parameters:
- connection_idint, optional
Connection id.
Defaults to the current connection.
- restart_session(connection_id=None)
Terminates the current session, drops all the temporary tables and starts a new one.
- Parameters:
- connection_idint, optional
Connection id.
Defaults to the current connection.
- clean_up_temporary_tables()
Drops all the temporary table under the current schema.
- hana_major_version()
Returns the major number of SAP HANA version.
- Parameters:
- None
- Returns:
- str
The major number of SAP HANA version.
Examples
cc is a connection to a SAP HANA database instance.
>>> cc.hana_major_version() '4'
- is_cloud_version()
Check whether the SAP HANA database instance is cloud version or on-premise.
- Parameters:
- None
- Returns:
- bool
If True, the HANA instance is cloud version.
- sql(sql)
Returns a SAP HANA DataFrame representing a query.
- Parameters:
- sqlstr
SQL query. The last sentence must be select statement.
- Returns:
- DataFrame
The DataFrame for the query.
Examples
cc is a connection to a SAP HANA database instance.
>>> df = cc.sql('SELECT T.A, T2.B FROM T, T2 WHERE T.C=T2.C')
- execute_sql(sql)
Multiline sql execution.
- Parameters:
- sqlstr or sql file
SQL query.
- table(table, schema=None, save_source=True, view_params=None)
Returns a DataFrame that represents the specified table.
- Parameters:
- tablestr
The table name.
- schemastr, optional, keyword-only
The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
- save_sourcebool, optional
If True, save the name of source table. Defaults to True.
- view_paramslist or tuple, optional
Parameters for view.
- Returns:
- DataFrame
The DataFrame that is selecting data from the specified table.
Examples
>>> df1 = cc.table('MY_TABLE') >>> df2 = cc.table('MY_OTHER_TABLE', schema='MY_SCHEMA')
- upsert_streams_data(table_name, key, data, schema=None)
This method will enable streams data to SAP HANA through SQL upsert if the provided data type contains bytes.
- Parameters:
- table_name: str
HANA table name to be upserted streams data.
- key: str
The key column name.
- data: dict
The keys of data are column names while the values are the data to upsert. If data contains bytes, the method will use HANA LOB streams method.
- schema: str, optional
The schema name.
Defaults to the current schema.
Examples
>>> with open('image.png', 'rb') as f: img = f.read() >>> with open('image2.png', 'rb') as f: img2 = f.read() >>> conn.upsert_streams_data(table_name="LOB_STREAMING_TEST", key="id", data={"id":1, "img":img, "other":img2})
- update_streams_data(table_name, key, data, schema=None)
This method will enable streams data to SAP HANA through SQL update if the provided data type contains bytes.
- Parameters:
- table_name: str
HANA table name to be updated streams data.
- key: str
The key column name.
- data: dict
The keys of data are column names while the values are the data to update. If data contains bytes, the method will use HANA LOB streams method.
- schema: str, optional
The schema name.
Defaults to the current schema.
Examples
>>> with open('image.png', 'rb') as f: img = f.read() >>> with open('image2.png', 'rb') as f: img2 = f.read() >>> conn.update_streams_data(table_name="LOB_STREAMING_TEST", key="id", data={"id":1, "img":img, "other":img2})
- to_sqlalchemy()
Returns a SQLAlchemy engine.
- class hana_ml.dataframe.DataFrame(connection_context, select_statement, _name=None)
Bases:
object
Represents a frame that is backed by a database SQL statement.
- Parameters:
- connection_contextConnectionContext
The connection to the SAP HANA database instance.
- select_statementstr
The SQL query backing the dataframe.
Note
Parameters beyond
connection_context
andselect_statement
are intended for internal use. Do not rely on them; they may change without notice.
- property columns
Lists the current DataFrame's column names. Computed lazily and cached. Each access to this property creates a new copy; mutating the list does not alter or corrupt the DataFrame.
- Parameters:
- None
- Returns:
- list of str
A list of column names.
Examples
df is a SAP HANA DataFrame.
>>> df.columns ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'target']
- property shape
Computes the shape of the SAP HANA DataFrame.
- Parameters:
- None
- Returns:
- tuple
(The number of rows, the number of columns) in the SAP HANA DataFrame.
Examples
df is a SAP HANA DataFrame.
>>> df.shape (1, 3)
- property name
Returns the name of the DataFrame. This value does not correspond to a SAP HANA table name. This value is useful for joining predicates when the joining DataFrames have columns with the same name.
- Parameters:
- None
- Returns:
- str
A str of DataFrame name.
Examples
df is a SAP HANA DataFrame.
>>> df.name 'DT_1'
- property quoted_name
Specifies the escaped name of the original DataFrame. Default-generated DataFrame names are safe to use in SQL without escaping, but names set with DataFrame.alias may require escaping.
- Parameters:
- None
- Returns:
- str
A str of DataFrame name.
Examples
df is a SAP HANA DataFrame.
>>> df.quoted_name '"DT_1"'
- property description
Return cur.description from the select_statement query.
- property description_ext
Return cur.description_ext() from the select_statement query.
- declare_lttab_usage(usage)
Declares whether this DataFrame makes use of local temporary tables.
Some SAP HANA PAL execution routines can execute more efficiently if they know up front whether a DataFrame's SELECT statement requires access to local temporary tables.
- Parameters:
- usagebool
Specifies whether this DataFrame uses local temporary tables.
- disable_validate_columns()
Disable the column validation.
- enable_validate_columns()
Enable the column validation.
- add_id(id_col=None, ref_col=None, starting_point=1)
Returns a new SAP HANA DataFrame with a added <id_col> column.
- Parameters:
- id_colstr, optional
The name of new ID column.
Defaults to "ID".
- ref_colstr or list of str, optional
The id is generated based on ref_col.
- starting_pointint, optional
The starting point of ID.
Defaults to 1.
- Returns:
- DataFrame
A new SAP HANA DataFrame with a added <id_col> column.
Examples
df is a SAP HANA DataFrame.
>>> df.collect() X Y 0 20 30 1 40 5
>>> df.add_id(id_col='ID') ID X Y 0 1 20 30 1 2 40 5
- add_constant(column_name, value)
Adds a new column with constant value.
- Parameters:
- columns_name: str
The name of column to be added.
- value: str
The constant value to be added.
- alias(alias)
Returns a new SAP HANA DataFrame with an alias set.
- Parameters:
- aliasstr
The name of the DataFrame.
- Returns:
- DataFrame
A SAP HANA DataFrame with an alias set.
See also
DataFrame.rename_columns
For renaming individual columns.
- count()
Computes the number of rows in the SAP HANA DataFrame.
- Parameters:
- None
- Returns:
- int
The number of rows in the SAP HANA DataFrame.
Examples
df is a SAP HANA DataFrame.
>>> df.count() 150
- diff(index, periods=1, diff_datetime='days')
Returns a new SAP HANA DataFrame with differenced values.
- Parameters:
- indexint or str
Index of the SAP HANA DataFrame.
- periodsint, optional
Periods to shift for calculating difference, accepts negative values.
Defaults to 1.
- diff_datetime{"years", "months", "days", "seconds", "nano100", "workdays"} or dict, optional
Specifies the difference type. It can also support diff_datetime for particular columns.
Defaults to "days".
- Returns
- -------
- DataFrame
DataFrame with differenced values. No calculation happens if it contains string.
- drop(cols)
Returns a new SAP HANA DataFrame without the specified columns.
- Parameters:
- colslist of str
The list of column names to be dropped.
- Returns:
- DataFrame
A new SAP HANA DataFrame that retains only the columns not listed in
cols
.
Examples
>>> df.collect() A B 0 1 3 1 2 4 >>> df.drop(['B']).collect() A 0 1 1 2
- distinct(cols=None)
Returns a new SAP HANA DataFrame with distinct values for the specified columns. If no columns are specified, then the distinct row values from all columns are returned.
- Parameters:
- colsstr or list of str, optional
A column or list of columns to consider when getting distinct values. Defaults to use all columns.
- Returns:
- DataFrame
The DataFrame with distinct values for cols.
Examples
Input:
>>> df.collect() A B C 0 1 A 100 1 1 A 101 2 1 A 102 3 1 B 100 4 1 B 101 5 1 B 102 6 1 B 103 7 2 A 100 8 2 A 100
Distinct values in a column:
>>> df.distinct("B").collect() B 0 A 1 B
Distinct values of a subset of columns:
>>> df.distinct(["A", "B"]).collect() A B 0 1 B 1 2 A 2 1 A
Distinct values of the entire data set:
>>> df.distinct().collect() A B C 0 1 A 102 1 1 B 103 2 1 A 101 3 2 A 100 4 1 B 101 5 1 A 100 6 1 B 100 7 1 B 102
- drop_duplicates(subset=None)
Returns a new SAP HANA DataFrame with duplicate rows removed. All columns in the DataFrame are returned. There is no way to keep specific duplicate rows.
Warning
Specifying a non-None value of
subset
may produce an unstable DataFrame, the contents of which may be different every time you look at it. Specifically, if two rows are duplicates in theirsubset
columns and have different values in other columns, Then a different row could be picked every time you look at the result.- Parameters:
- subsetlist of str, optional
A list of columns to consider when deciding whether rows are duplicates of each other. Defaults to use all columns.
- Returns:
- DataFrame
A DataFrame with only one copy of duplicate rows.
Examples
Input:
>>> df.collect() A B C 0 1 A 100 1 1 A 101 2 1 A 102 3 1 B 100 4 1 B 101 5 1 B 102 6 1 B 103 7 2 A 100 8 2 A 100
Drop duplicates based on the values of a subset of columns:
>>> df.drop_duplicates(["A", "B"]).collect() A B C 0 1 A 100 1 1 B 100 2 2 A 100
Distinct values on the entire data set:
>>> df.drop_duplicates().collect() A B C 0 1 A 102 1 1 B 103 2 1 A 101 3 2 A 100 4 1 B 101 5 1 A 100 6 1 B 100 7 1 B 102
- dropna(how=None, thresh=None, subset=None)
Returns a new DataFrame with NULLs removed.
- Parameters:
- how{'any', 'all'}, optional
If provided, 'any' eliminates rows with any NULLs, and 'all' eliminates rows that are entirely NULLs. If neither
how
northresh
are provided,how
defaults to 'any'.- threshint, optional
If provided, rows with fewer than
thresh
non-NULL values are dropped. You cannot specify bothhow
andthresh
.- subsetlist of str, optional
The columns to consider when looking for NULLs. Values in other columns are ignored, whether they are NULL or not. Defaults to all columns.
- Returns:
- DataFrame
A new SAP HANA DataFrame with a SELECT statement that removes NULLs.
Examples
Dropping rows with any NULL:
>>> df.collect() A B C 0 1.0 3.0 5.0 1 2.0 4.0 NaN 2 3.0 NaN NaN 3 NaN NaN NaN >>> df.dropna().collect() A B C 0 1.0 3.0 5.0
Dropping rows that are entirely nulls:
>>> df.dropna(how='all').collect() A B C 0 1.0 3.0 5.0 1 2.0 4.0 NaN 2 3.0 NaN NaN
Dropping rows with less than 2 non-null values:
>>> df.dropna(thresh=2).collect() A B C 0 1.0 3.0 5.0 1 2.0 4.0 NaN
- deselect(cols)
Returns a new DataFrame without columns derived from the current DataFrame.
- Parameters:
- colsstr or tuple/list of str.
The columns are excluded in the new DataFrame.
- Returns:
- DataFrame
A new DataFrame object excluding the specified columns.
Examples
Input:
>>> df.collect() A B C 0 1 2 3
Selecting a subset of existing columns:
>>> df.deselect(['A', 'B']).collect() C 0 3
- has_constant_columns()
Returns a sequence of constant columns in the DataFrame.
- drop_constant_columns()
Returns a DataFrame without constant columns.
- dtypes(subset=None)
Returns a sequence of tuples describing the DataFrame's SQL types.
The tuples list the name, SQL type name, display_size, internal_size, precision and scale corresponding to the DataFrame's columns.
- Parameters:
- subsetlist of str, optional
The columns that the information is generated from. Defaults to all columns.
- Returns:
- dtypeslist of tuples
Each tuple consists of the name, SQL type name, display_size, internal_size, precision and scale for one of the DataFrame's columns. The list is in the order specified by the
subset
, or in the DataFrame's original column order if asubset
is not provided.
- empty()
Returns True if this DataFrame has 0 rows.
- Parameters:
- None
- Returns:
- bool
True if the DataFrame is empty.
Notes
If a DataFrame contains only NULLs, it is not considered empty.
Examples
>>> df1.collect() Empty DataFrame Columns: [ACTUAL, PREDICT] Index: [] >>> df1.empty() True
>>> df2.collect() ACTUAL PREDICT 0 None None >>> df2.empty() False
- filter(condition)
Selects rows that match the given condition.
Very little checking is done on the condition string. Use only with trusted inputs.
- Parameters:
- conditionstr or list
A filter condition. Format as SQL <condition>.
- Returns:
- DataFrame
A DataFrame with rows that match the given condition.
- Raises:
- hana_ml.ml_exceptions.BadSQLError
If comments or malformed tokens are detected in
condition
. May have false positives and false negatives.
Examples
>>> df.collect() A B 0 1 3 1 2 4 >>> df.filter('B < 4').collect() A B 0 1 3
- has(col)
Returns True if a column is in the DataFrame.
- Parameters:
- colstr
The name of column to search in the projection list of this DataFrame.
- Returns:
- bool
Returns True if the column exists in the DataFrame's projection list.
Examples
>>> df.columns ['A', 'B'] >>> df.has('A') True >>> df.has('C') False
- head(n=1)
Returns a DataFrame of the first
n
rows in the current DataFrame.- Parameters:
- nint, optional
The number of rows returned.
Defaults to 1.
- Returns:
- DataFrame
A new DataFrame of the first
n
rows of the current DataFrame.
- hasna(cols=None)
Returns True if a DataFrame contains NULLs.
- Parameters:
- colsstr or list of str, optional
A column or list of columns to be checked for NULL values. Defaults to all columns.
- Returns:
- bool
True if this DataFrame contains NULLs.
Examples
>>> df1.collect() ACTUAL PREDICT 0 1.0 None
>>> df1.hasna() True
- fillna(value, subset=None)
Returns a DataFrame with NULLs replaced with a specified value.
- Parameters:
- valueint or float
The value that replaces NULL.
value
should have a type that is appropriate for the selected columns.- subsetlist of str, optional
A list of columns whose NULL values will be replaced. Defaults to all columns.
- Returns:
- DataFrame
A new DataFrame with the NULL values replaced.
- get_table_structure()
Returns dict format table structure.
- join(other, condition=None, how='inner', select=None)
Returns a new DataFrame that is a join of the current DataFrame with another specified DataFrame.
- Parameters:
- otherDataFrame or list of DataFrame
The DataFrame to join with.
- conditionstr, optional
The join predicate. If index has been set, use the index as key to join.
Defaults to None.
- how{'inner', 'left', 'right', 'outer', 'cross'}, optional
The type of join. Defaults to 'inner'.
Defaults to 'inner'.
- selectlist, optional
If provided, each element specifies a column in the result. A string in the
select
list should be the name of a column in one of the input DataFrames. A (expression, name) tuple creates a new column with the given name, computed from the given expression.If this value is not provided, defaults to selecting all columns from both DataFrames, with the left DataFrame's columns first.
- Returns:
- DataFrame
A new DataFrame object made from the join() of the current DataFrame with another DataFrame.
- Raises:
- hana_ml.ml_exceptions.BadSQLError
If comments or malformed tokens are detected in
condition
or in a column expression. May have false positives and false negatives.
Examples
Use the expression selection functionality to disambiguate duplicate column names in a join():
>>> df1.collect() A B C 0 1 2 3.5 1 2 4 5.6 2 3 3 1.1
>>> df2.collect() A B D 0 2 1 14.0 1 3 4 5.6 2 4 3 0.0
Old method:
>>> df1.alias('L').join(df2.alias('R'), 'L.A = R.A', select=[ ... ('L.A', 'A'), ... ('L.B', 'B_LEFT'), ... ('R.B', 'B_RIGHT'), ... 'C', ... 'D']).collect() A B_LEFT B_RIGHT C D 0 2 4 1 5.6 14.0 1 3 3 4 1.1 5.6
New method:
>>> df1.set_index("A").join(df2.rename_columns({"B":"B2"}).set_index("A")).collect() A B B2 C D 0 2 4 1 5.6 14 1 3 3 4 1.1 5.6
- set_name(name)
Sets the name of the DataFrame.
- Parameters:
- namestr
The name of dataframe.
- set_index(keys)
Sets the index of the DataFrame.
- Parameters:
- keysstr or list of str
This parameter can be either a single column key or a list of column keys.
- save(where, table_type=None, force=False, save_source=True, append=False, data_lake=False, data_lake_container='SYSRDL#CG', view_structure=None)
Creates a table or view holding the current DataFrame's data.
- Parameters:
- wherestr or (str, str) tuple
The table name or (schema name, table name) tuple. If no schema is provided, then the table or view is created in the current schema.
- table_typestr, optional
The type of table to create. The value is case insensitive.
Permanent table options:
"ROW"
"COLUMN"
"HISTORY COLUMN"
Temporary table options:
"GLOBAL TEMPORARY"
"GLOBAL TEMPORARY COLUMN"
"LOCAL TEMPORARY"
"LOCAL TEMPORARY COLUMN"
Not a table:
"VIEW"
Defaults to 'LOCAL TEMPORARY COLUMN' if
where
starts with '#'. Otherwise, the default is 'COLUMN'.- forcebool, optional
If force is True, it will replace the existing table.
Defaults to False.
- save_sourcebool, optional
If True, it will save the name of source table.
Defaults to True.
- appendbool, optional
If True, it will use the existing table and append data to it.
Defaults to False.
- data_lakebool, optional
If True, it will save the table to HANA data lake.
Defaults to False.
- data_lake_containerstr, optional
Name of the data lake container.
Defaults to 'SYSRDL#CG'.
- view_structuredict, optional
Define the parameters in the view. Only valid when table_type="VIEW".
Defaults to None.
- Returns:
- DataFrame
A DataFrame that represents the new table or view.
Note
For this operation to succeed, the table name must not be in use, the schema must exist, and the user must have permission to create tables (or views) in the target schema.
- save_nativedisktable(where, force=False, save_source=True)
Materializes dataframe to a SAP HANA native disk.
- Parameters:
- wherestr or (str, str) tuple
The table name or (schema name, table name) tuple. If no schema is provided, then the table or view is created in the current schema.
- forcebool, optional
If force is True, it will replace the existing table.
- save_sourcebool, optional
If True, it will save the name of source table. Defaults to True.
- Returns:
- DataFrame
A DataFrame that represents the new table.
- split_column(column, separator, new_column_names)
Returns a new DataFrame with splitted column.
- Parameters:
- columnstr
A column or list of columns to be splitted.
- separatorstr
The separator.
- new_column_nameslist of str
The splitted column names
- Returns:
- DataFrame
New DataFrame object with splitted columns as specified.
Examples
>>> df.collect() ID COL 0 1 1,2,3 1 2 3,4,4
>>> df.split_column(column="COL", separator=",", new_column_names=["COL1", "COL2", "COL3"]).collect() ID COL COL1 COL2 COL3 0 1 1,2,3 1 2 3 1 2 3,4,4 3 4 4
- concat_columns(columns, separator)
Returns a new DataFrame with splitted column.
- Parameters:
- columnslist of str
A list of columns to be concatenated.
- separatorstr
The separator.
- Returns:
- DataFrame
New DataFrame object with concat column as specified.
Examples
>>> df.collect() ID COL1 COL2 COL3 0 1 1 2 3 1 2 3 4 4
>>> df.concat_columns(columns=["COL1", "COL2", "COL3"], separator=",").collect() ID COL1 COL2 COL3 COL1,COL2,COL3 0 1 1 2 3 1,2,3 1 2 3 4 4 3,4,4
- nullif(value)
Replace certain value with NULL value.
- Parameters:
- value: scalar or dict
To-be-replaced value. If the type is dict, its key indicates the column name.
- replace(to_replace=None, value=None, regex=False)
Returns a new DataFrame with replaced value.
- Parameters:
- to_replacenumeric, str or dict, optional
The value/pattern to be replaced. If regex is True, the regex will be used instead of value.
numeric or str: the value equal to to_replace will be replaced by value.
dict:
value
is None: the value equal to the key ofto_replace
will be replaced by its value. If it is nested JSON, the to-be-replaced columns will be restricted by the first-level keys.value
is numeric or str: the value equal to the value ofto_replace
will be replaced byvalue
. The to-be-replaced columns will be restricted by the keys ofto_replace
.value
is dict: the value equal to the value ofto_replace
will be replaced by the value ofvalue
under the same key. The to-be-replaced columns and the replaced value will be restricted by the keys ofto_replace
andvalue
.
None:
regex
will be used.
- valuenumeric, str or dict, optional
Value to replace.
numeric or str: The value to replace
to_replace
or according to the pattern if regex is given.dict: the replacement will take place under the columns equal to the keys.
- regexbool or str, optional
Use regex or not.
bool: use regex if True.
str: work the same as
to_replace
ifto_replace
is None.
- Returns:
- DataFrame
New DataFrame object with replaced values.
Examples
>>> df.collect() Aa Bb Cb 0 0 5 a 1 10 0 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace(to_replace=0, value=5).collect() Aa Bb Cb 0 5 5 a 1 10 5 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace(to_replace={0: 10, 1: 100}).collect() Aa Bb Cb 0 10 5 a 1 10 10 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace(to_replace={'Aa': 0, 'Bb': 5}, value=100).collect() Aa Bb Cb 0 100 100 a 1 10 0 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace(to_replace={'Aa': 0, 'Bb': 5}, value={'Aa': 100, 'Bb': 50}).collect() Aa Bb Cb 0 100 50 a 1 10 0 b 2 2 7 c 3 3 8 d 4 4 9 e
>>> df.replace(to_replace={'Aa': {0: 100, 4: 400}}).collect() Aa Bb Cb 0 100 5 a 1 10 0 b 2 2 7 c 3 3 8 d 4 400 9 e
>>> df2.collect() A B 0 bat abc 1 foo bar 2 bait xyz
>>> df2.replace(to_replace=r'^ba.$', value='new', regex=True).collect() A B 0 new abc 1 foo new 2 bait xyz
>>> df2.replace(to_replace={'A': r'^ba.$'}, value={'A': 'new'}, regex=True).collect() A B 0 new abc 1 foo bar 2 bait xyz
>>> df2.replace(regex=r'^ba.$', value='new').collect() A B 0 new abc 1 foo new 2 bait xyz
>>> df2.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'}).collect() A B 0 new abc 1 xyz new 2 bait xyz
- sort(cols, desc=False)
Returns a new DataFrame sorted by the specified columns.
- Parameters:
- colsstr or list of str
A column or list of columns to sort by. If a list is specified, then the sort order in parameter desc is used for all columns.
- descbool, optional
Set to True to sort in descending order. Defaults to False, for ascending order. Default value is False.
- Returns:
- DataFrame
New DataFrame object with rows sorted as specified.
- sort_values(by, ascending=True)
Returns a new DataFrame sorted by the specified columns.
- Parameters:
- bystr or list of str
A column or list of columns to sort by. If a list is specified, then the sort order in parameter ascending is used for all columns.
- ascendingbool, optional
Set to False to sort in descending order.
Defaults to True, for ascending order.
- Returns:
- DataFrame
New DataFrame object with rows sorted as specified.
- sort_index(ascending=True)
Returns a new DataFrame sorted by the index.
- Parameters:
- ascendingbool, optional
Set to False to sort in descending order. Defaults to False, for ascending order.
Defaults to True
- Returns:
- DataFrame
New DataFrame object with rows sorted as specified.
- select(*cols)
Returns a new DataFrame with columns derived from the current DataFrame.
Warning
There is no check that inputs interpreted as SQL expressions are actually valid expressions; an "expression" like "A FROM TAB; DROP TABLE IMPORTANT_THINGS; SELECT A" can cause a lot of damage.
- Parameters:
- colsstr or (str, str) tuple.
The columns in the new DataFrame. A string is treated as the name of a column to select; a (str, str) tuple is treated as (SQL expression, alias). As a special case, '*' is expanded to all columns of the original DataFrame.
- Returns:
- DataFrame
A new DataFrame object with the specified columns.
- Raises:
- hana_ml.ml_exceptions.BadSQLError
If comments or malformed tokens are detected in a column expression. May have false positives and false negatives.
Examples
Input:
>>> df.collect() A B C 0 1 2 3
Selecting a subset of existing columns:
>>> df.select('A', 'B').collect() A B 0 1 2
Computing a new column based on existing columns:
>>> df.select('*', ('B * 4', 'D')).collect() A B C D 0 1 2 3 8
- set_operations(other, all=True, op='UNION')
Combines this DataFrame's rows and another DataFrame's rows into one DataFrame. This operation is equivalent to a SQL UNION/INTERSECT/EXCEPT ALL.
- Parameters:
- otherDataFrame, list of DataFrame
The right side of the operation.
- allbool, optional
If True, keep duplicate rows; equivalent to UNION/INTERSECT/EXCEPT ALL in SQL. If False, keep only one copy of duplicate rows (even if they come from the same side of the operation); equivalent to a UNION/INTERSECT/EXCEPT or a UNION/INTERSECT/EXCEPT ALL followed by DISTINCT in SQL. Defaults to True.
- Returns:
- DataFrame
The combined data from
self
andother
.
Examples
We have two DataFrames we want to union, with some duplicate rows:
>>> df1.collect() A B 0 1 2 1 1 2 2 2 3
>>> df2.collect() A B 0 2 3 1 3 4
>>> df1.set_operations(df2, op='UNION').collect() A B 0 1 2 1 1 2 2 2 3 3 2 3 4 3 4
If we want to use except instead of union, we can do:
>>> df1.union(df2, all=False, op='except').collect() A B 0 1 2 1 1 2
- union(other, all=True)
Combines this DataFrame's rows and another DataFrame's rows into one DataFrame. This operation is equivalent to a SQL UNION ALL.
- Parameters:
- otherDataFrame, list of DataFrame
The right side of the union.
- allbool, optional
If True, keep duplicate rows; equivalent to UNION ALL in SQL. If False, keep only one copy of duplicate rows (even if they come from the same side of the union); equivalent to a UNION or a UNION ALL followed by DISTINCT in SQL. Defaults to True.
- Returns:
- DataFrame
The combined data from
self
andother
.
Examples
We have two DataFrames we want to union, with some duplicate rows:
>>> df1.collect() A B 0 1 2 1 1 2 2 2 3
>>> df2.collect() A B 0 2 3 1 3 4
union() produces a DataFrame that contains all rows of both df1 and df2, like a UNION ALL:
>>> df1.union(df2).collect() A B 0 1 2 1 1 2 2 2 3 3 2 3 4 3 4
To get the deduplication effect of a UNION DISTINCT, pass all=False or call distinct() after union():
>>> df1.union(df2, all=False).collect() A B 0 1 2 1 2 3 2 3 4 >>> df1.union(df2).distinct().collect() A B 0 1 2 1 2 3 2 3 4
- collect(fetch_size=32000, geometries=True, convert_geo_to_shapely=True)
Copies the current DataFrame to a new Pandas DataFrame.
- Parameters:
- fetch_sizeint, optional
Fetch size in hdbcli.
- geometriesbool, optional
With this flag set to True (default), the geometries are converted to Well-Known-Text representations in the resulting Pandas dataframe. Even if they are converted to Shapely objects (see
convert_geo_to_shapely
), when you print the dataframe, the geometry columns are represented as Well-Known-Text.If you need the raw binary values, set this flag to False
- convert_geo_to_shapelybool, optional
If set to True (default), all geometry columns will be converted to a Shapely object, so that the dataframe can be directly used in visualization libraries for example. If your processing does not support Shapely objects, you can switch this conversion off. In this case the columns remain of type String.
Note
Before the conversion, it is checked, if any value in the column is NULL. If so, this column will not be converted to a Shapely object, because NULL WKT strings are not supported by Shapely.
- Returns:
- pandas.DataFrame
A Pandas DataFrame that contains the current DataFrame's data.
Examples
Viewing a hana_ml DataFrame doesn't execute the underlying SQL or fetch the data:
>>> df = cc.table('T') >>> df <hana_ml.dataframe.DataFrame object at 0x7f2b7f24ddd8>
Using collect() executes the SQL and fetches the results into a Pandas DataFrame:
>>> df.collect() A B 0 1 3 1 2 4
>>> type(df.collect()) <class 'pandas.core.frame.DataFrame'>
- property geometries: list
Returns the geometries of a data frame. The list is empty if there are none.
- Returns:
- list
List with geometry columns
- property srids: dict
Returns the srid for each geometry column in the dataframe. If none is found, the dictionary will be {}.
For dataframes based on HANA catalog objects, the information is read from the catalog. For Dataframes, which do not have a catalog object (e.g. are based on SQL statements, or temporary tables ), the SRID is derived by selecting the first row in the table and read it directly from the EWKB. For columns with multiple SRSes (SRID NULL), this means, that you might get back a SRS, which differs from other entries in the same column.
Known Limitation: For dataframes which don't have catalog objects and do not contain data, no SRID can be provided.
- Returns:
- dict
Dictionary with the SRID per column: {<column_name>: <srid>} Returns {} when none are found.
- rename_columns(names)
Returns a DataFrame with renamed columns.
- Parameters:
- nameslist or dict
If a list, specifies new names for every column in this DataFrame. If a dict, each dict entry maps an existing name to a new name, and not all columns need to be renamed.
- Returns:
- DataFrame
The same data as the original DataFrame with new column names.
See also
DataFrame.alias
For renaming the DataFrame itself.
Examples
>>> df.collect() A B 0 1 3 1 2 4
>>> df.rename_columns(['C', 'D']).collect() C D 0 1 3 1 2 4
>>> df.rename_columns({'B': 'D'}).collect() A D 0 1 3 1 2 4
- auto_cast(type_convert)
Returns a DataFrame with converted column types.
- Parameters:
- type_convertdict
Specifies the original type and the new type in Dict. e.g. {"INT": "DOUBLE"}
- Returns:
- DataFrame
The same data as this DataFrame, but with columns cast to the specified type.
- cast(cols, new_type=None)
Returns a DataFrame with columns cast to a new type.
- The name of the column in the returned DataFrame is the same as the original column.
Warning
Type compatibility between the existing column type and the new type is not checked. An incompatibility results in an error.
- Parameters:
- colsstr, list of str or dict
The column(s) to be cast to a different type.
- new_typestr
The database datatype to cast the column(s) to. No checks are performed to see if the new type is valid. An invalid type can lead to SQL errors or even SQL injection vulnerabilities.
- Returns:
- DataFrame
The same data as this DataFrame, but with columns cast to the specified type.
Examples
Input:
>>> df1 = cc.sql('SELECT "AGE", "PDAYS", "HOUSING" FROM DBM_TRAINING_TBL') >>> df1.dtypes() [('AGE', 'INT', 10, 10, 10, 0), ('PDAYS', 'INT', 10, 10, 10, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]
Casting a column to NVARCHAR(20):
>>> df2 = df1.cast('AGE', 'NVARCHAR(20)') >>> df2.dtypes() [('AGE', 'NVARCHAR', 20, 20, 20, 0), ('PDAYS', 'INT', 10, 10, 10, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]
Casting a list of columns to NVARCHAR(50):
>>> df3 = df1.cast(['AGE', 'PDAYS'], 'NVARCHAR(50)') >>> df3.dtypes() [('AGE', 'NVARCHAR', 50, 50, 50, 0), ('PDAYS', 'NVARCHAR', 50, 50, 50, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)] >>> df4 = df1.cast({'AGE': 'VARCHAR(50)', {'PDAYS': 'INT'}}) >>> df4.dtypes() [('AGE', 'VARCHAR', 50, 50, 50, 0), ('PDAYS', 'INT', 50, 50, 50, 0), ('HOUSING', 'VARCHAR', 100, 100, 100, 0)]
- tail(n=1, ref_col=None)
Returns a DataFrame of the last
n
rows in the current DataFrame.- Parameters:
- nint, optional
The number of rows returned.
Defaults to 1.
- ref_colstr or list of str, optional
Sorting the dataframe based on the ref_col column.
Defaults to None.
- Returns:
- DataFrame
A new DataFrame of the last
n
rows of the current DataFrame.
- to_head(col)
Returns a DataFrame with specified column as the first item in the projection.
- Parameters:
- colstr
The column to move to the first position.
- Returns:
- DataFrame
The same data as this DataFrame but with the specified column in the first position.
Examples
Input:
>>> df1 = cc.table("DBM_TRAINING") >>> import pprint >>> pprint.pprint(df1.columns) ['ID', 'AGE', 'JOB', 'MARITAL', 'EDUCATION', 'DBM_DEFAULT', 'HOUSING', 'LOAN', 'CONTACT', 'DBM_MONTH', 'DAY_OF_WEEK', 'DURATION', 'CAMPAIGN', 'PDAYS', 'PREVIOUS', 'POUTCOME', 'EMP_VAR_RATE', 'CONS_PRICE_IDX', 'CONS_CONF_IDX', 'EURIBOR3M', 'NREMPLOYED', 'LABEL']
Moving the column 'LABEL' to head:
>>> df2 = df1.to_head('LABEL') >>> pprint.pprint(df2.columns) ['LABEL', 'ID', 'AGE', 'JOB', 'MARITAL', 'EDUCATION', 'DBM_DEFAULT', 'HOUSING', 'LOAN', 'CONTACT', 'DBM_MONTH', 'DAY_OF_WEEK', 'DURATION', 'CAMPAIGN', 'PDAYS', 'PREVIOUS', 'POUTCOME', 'EMP_VAR_RATE', 'CONS_PRICE_IDX', 'CONS_CONF_IDX', 'EURIBOR3M', 'NREMPLOYED']
- to_tail(col)
Returns a DataFrame with specified column as the last item in the projection.
- Parameters:
- colstr
The column to move to the last position.
- Returns:
- DataFrame
The same data as this DataFrame but with the specified column in the last position.
Examples
Input:
>>> df1 = cc.table("DBM_TRAINING") >>> import pprint >>> pprint.pprint(df1.columns) ['LABEL', 'ID', 'AGE', 'JOB', 'MARITAL', 'EDUCATION', 'DBM_DEFAULT', 'HOUSING', 'LOAN', 'CONTACT', 'DBM_MONTH', 'DAY_OF_WEEK', 'DURATION', 'CAMPAIGN', 'PDAYS', 'PREVIOUS', 'POUTCOME', 'EMP_VAR_RATE', 'CONS_PRICE_IDX', 'CONS_CONF_IDX', 'EURIBOR3M', 'NREMPLOYED']
Moving the column 'LABEL' to head:
>>> df2 = df1.to_tail('LABEL') >>> pprint.pprint(df2.columns) ['ID', 'AGE', 'JOB', 'MARITAL', 'EDUCATION', 'DBM_DEFAULT', 'HOUSING', 'LOAN', 'CONTACT', 'DBM_MONTH', 'DAY_OF_WEEK', 'DURATION', 'CAMPAIGN', 'PDAYS', 'PREVIOUS', 'POUTCOME', 'EMP_VAR_RATE', 'CONS_PRICE_IDX', 'CONS_CONF_IDX', 'EURIBOR3M', 'NREMPLOYED', 'LABEL']
- summary(cols=None, select_stat=None, pivoted=True)
Returns a DataFrame that contains various statistics for the requested column(s).
- Parameters:
- colsstr or list, optional
The column(s) to be described. Defaults to all columns.
- select_statstr or list, optional
The statistics to be selected. "nulls", "unique" and "valid observations" are always computed. - "min": the minimum value
"lower_quartile": the lower quartile
"median": the median value
"upper_quartile": the upper quartile
"max": the max value
"mean": the average
"lower_mean_ci" : the lower bound of the mean value
"upper_mean_ci": the upper bound of the mean value
"trimmed_mean": the average of data after taking out 5% of the largest elements and the same amount of the smallest elements
"variance": the variance
"sd": the standard deviation
"skewness": the measure of the asymmetry of a distribution
"kurtosis": the measure of the tailedness of a distribution
- pivotedbool, optional
Whether output format is pivoted.
Default to True.
- Returns
- -------
- DataFrame
A DataFrame that contains (selected) statistics for the specified column(s).
- property stats
The statistics of the dataframe.
- describe(cols=None, version='v1')
Returns a DataFrame that contains various statistics for the requested column(s).
- Parameters:
- colsstr or list, optional
The column(s) to be described. Defaults to all columns.
- version{'v1', 'v2'}, optional
Version v2 will use PAL instead of dynamic SQL.
Defaults to 'v1'.
- Returns:
- DataFrame
A DataFrame that contains statistics for the specified column(s) in the current DataFrame.
The statistics included are:
the count of rows ("count"),
the number of distinct values ("unique"),
the number of nulls ("nulls"),
the average ("mean"),
the standard deviation("std")
the median ("median"),
the minimum value ("min"),
the maximum value ("max"),
the 25% percentile when treated as continuous variable ("25_percent_cont"),
the 25% percentile when treated as discrete variable ("25_percent_disc"),
the 50% percentile when treated as continuous variable ("50_percent_cont"),
the 50% percentile when treated as discrete variable ("50_percent_disc"),
the 75% percentile when treated as continuous variable ("75_percent_cont"),
the 75% percentile when treated as discrete variable ("75_percent_disc").
For columns that are strings, statistics such as average ("mean"), standard deviation ("std"), median ("median"), and the various percentiles are NULLs.
If the list of columns contain both string and numeric data types, minimum and maximum values become NULLs.
Examples
Input:
>>> df1 = cc.table("DBM_TRAINING") >>> import pprint >>> pprint.pprint(df2.columns) ['LABEL', 'ID', 'AGE', 'JOB', 'MARITAL', 'EDUCATION', 'DBM_DEFAULT', 'HOUSING', 'LOAN', 'CONTACT', 'DBM_MONTH', 'DAY_OF_WEEK', 'DURATION', 'CAMPAIGN', 'PDAYS', 'PREVIOUS', 'POUTCOME', 'EMP_VAR_RATE', 'CONS_PRICE_IDX', 'CONS_CONF_IDX', 'EURIBOR3M', 'NREMPLOYED']
Describe a few numeric columns and collect them to return a Pandas DataFrame:
>>> df1.describe(['AGE', 'PDAYS']).collect() column count unique nulls mean std min max median 0 AGE 16895 78 0 40.051376 10.716907 17 98 38 1 PDAYS 16895 24 0 944.406688 226.331944 0 999 999 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 0 32.0 32 38.0 38 1 999.0 999 999.0 999 75_percent_cont 75_percent_disc 0 47.0 47 1 999.0 999
Describe some non-numeric columns and collect them to return a Pandas DataFrame:
>>> df1.describe(['JOB', 'MARITAL']).collect() column count unique nulls mean std min max median 0 JOB 16895 12 0 None None admin. unknown None 1 MARITAL 16895 4 0 None None divorced unknown None 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 0 None None None None 1 None None None None 75_percent_cont 75_percent_disc 0 None None 1 None None
Describe all columns in a DataFrame:
>>> df1.describe().collect() column count unique nulls mean std 0 ID 16895 16895 0 21282.286652 12209.759725 1 AGE 16895 78 0 40.051376 10.716907 2 DURATION 16895 1267 0 263.965670 264.331384 3 CAMPAIGN 16895 35 0 2.344658 2.428449 4 PDAYS 16895 24 0 944.406688 226.331944 5 PREVIOUS 16895 7 0 0.209529 0.539450 6 EMP_VAR_RATE 16895 10 0 -0.038798 1.621945 7 CONS_PRICE_IDX 16895 26 0 93.538844 0.579189 8 CONS_CONF_IDX 16895 26 0 -40.334123 4.865720 9 EURIBOR3M 16895 283 0 3.499297 1.777986 10 NREMPLOYED 16895 11 0 5160.371885 75.320580 11 JOB 16895 12 0 NaN NaN 12 MARITAL 16895 4 0 NaN NaN 13 EDUCATION 16895 8 0 NaN NaN 14 DBM_DEFAULT 16895 2 0 NaN NaN 15 HOUSING 16895 3 0 NaN NaN 16 LOAN 16895 3 0 NaN NaN 17 CONTACT 16895 2 0 NaN NaN 18 DBM_MONTH 16895 10 0 NaN NaN 19 DAY_OF_WEEK 16895 5 0 NaN NaN 20 POUTCOME 16895 3 0 NaN NaN 21 LABEL 16895 2 0 NaN NaN min max median 25_percent_cont 25_percent_disc 0 5.000 41187.000 21786.000 10583.500 10583.000 1 17.000 98.000 38.000 32.000 32.000 2 0.000 4918.000 184.000 107.000 107.000 3 1.000 43.000 2.000 1.000 1.000 4 0.000 999.000 999.000 999.000 999.000 5 0.000 6.000 0.000 0.000 0.000 6 -3.400 1.400 1.100 -1.800 -1.800 7 92.201 94.767 93.444 93.075 93.075 8 -50.800 -26.900 -41.800 -42.700 -42.700 9 0.634 5.045 4.856 1.313 1.313 10 4963.000 5228.000 5191.000 5099.000 5099.000 11 NaN NaN NaN NaN NaN 12 NaN NaN NaN NaN NaN 13 NaN NaN NaN NaN NaN 14 NaN NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN 16 NaN NaN NaN NaN NaN 17 NaN NaN NaN NaN NaN 18 NaN NaN NaN NaN NaN 19 NaN NaN NaN NaN NaN 20 NaN NaN NaN NaN NaN 21 NaN NaN NaN NaN NaN 50_percent_cont 50_percent_disc 75_percent_cont 75_percent_disc 0 21786.000 21786.000 32067.500 32068.000 1 38.000 38.000 47.000 47.000 2 184.000 184.000 324.000 324.000 3 2.000 2.000 3.000 3.000 4 999.000 999.000 999.000 999.000 5 0.000 0.000 0.000 0.000 6 1.100 1.100 1.400 1.400 7 93.444 93.444 93.994 93.994 8 -41.800 -41.800 -36.400 -36.400 9 4.856 4.856 4.961 4.961 10 5191.000 5191.000 5228.000 5228.000 11 NaN NaN NaN NaN 12 NaN NaN NaN NaN 13 NaN NaN NaN NaN 14 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 NaN NaN NaN NaN 17 NaN NaN NaN NaN 18 NaN NaN NaN NaN 19 NaN NaN NaN NaN 20 NaN NaN NaN NaN 21 NaN NaN NaN NaN
- bin(col, strategy='uniform_number', bins=None, bin_width=None, bin_column='BIN_NUMBER')
Returns a DataFrame with the original columns as well as bin assignments.
The name of the columns in the returned DataFrame is the same as the original column. Column "BIN_NUMBER" or the specified value in
bin_column
is added and corresponds to the bin assigned.- Parameters:
- colstr
The column on which binning is performed. The column must be numeric.
- strategy{'uniform_number', 'uniform_size'}, optional
Binning methods:
'uniform_number': Equal widths based on the number of bins.
'uniform_size': Equal widths based on the bin size.
Default value is 'uniform_number'.
- binsint, optional
The number of equal-width bins. Only valid when
strategy
is 'uniform_number'.Defaults to 10.
- bin_widthint, optional
The interval width of each bin. Only valid when
strategy
is 'uniform_size'.- bin_columnstr, optional
The name of the output column that contains the bin number.
- Returns:
- DataFrame
A binned dataset with the same data as this DataFrame, as well as an additional column "BIN_NUMBER" or the value specified in
bin_column
. This additional column contains the assigned bin for each row.
Examples
Input:
>>> df.collect() C1 C2 C3 C4 0 1 1.2 2.0 1.0 1 2 1.4 4.0 3.0 2 3 1.6 6.0 9.0 3 4 1.8 8.0 27.0 4 5 2.0 10.0 81.0 5 6 2.2 12.0 243.0 6 7 2.4 14.0 729.0 7 8 2.6 16.0 2187.0 8 9 2.8 18.0 6561.0 9 10 3.0 20.0 19683.0
Create five bins of equal widths on C1:
>>> df.bin('C1', strategy='uniform_number', bins=5).collect() C1 C2 C3 C4 BIN_NUMBER 0 1 1.2 2.0 1.0 1 1 2 1.4 4.0 3.0 1 2 3 1.6 6.0 9.0 2 3 4 1.8 8.0 27.0 2 4 5 2.0 10.0 81.0 3 5 6 2.2 12.0 243.0 3 6 7 2.4 14.0 729.0 4 7 8 2.6 16.0 2187.0 4 8 9 2.8 18.0 6561.0 5 9 10 3.0 20.0 19683.0 5
Create five bins of equal widths on C2:
>>> df.bin('C3', strategy='uniform_number', bins=5).collect() C1 C2 C3 C4 BIN_NUMBER 0 1 1.2 2.0 1.0 1 1 2 1.4 4.0 3.0 1 2 3 1.6 6.0 9.0 2 3 4 1.8 8.0 27.0 2 4 5 2.0 10.0 81.0 3 5 6 2.2 12.0 243.0 3 6 7 2.4 14.0 729.0 4 7 8 2.6 16.0 2187.0 4 8 9 2.8 18.0 6561.0 5 9 10 3.0 20.0 19683.0 5
Create five bins of equal widths on a column that varies significantly:
>>> df.bin('C4', strategy='uniform_number', bins=5).collect() C1 C2 C3 C4 BIN_NUMBER 0 1 1.2 2.0 1.0 1 1 2 1.4 4.0 3.0 1 2 3 1.6 6.0 9.0 1 3 4 1.8 8.0 27.0 1 4 5 2.0 10.0 81.0 1 5 6 2.2 12.0 243.0 1 6 7 2.4 14.0 729.0 1 7 8 2.6 16.0 2187.0 1 8 9 2.8 18.0 6561.0 2 9 10 3.0 20.0 19683.0 5
Create bins of equal width:
>>> df.bin('C1', strategy='uniform_size', bin_width=3).collect() C1 C2 C3 C4 BIN_NUMBER 0 1 1.2 2.0 1.0 1 1 2 1.4 4.0 3.0 1 2 3 1.6 6.0 9.0 2 3 4 1.8 8.0 27.0 2 4 5 2.0 10.0 81.0 2 5 6 2.2 12.0 243.0 3 6 7 2.4 14.0 729.0 3 7 8 2.6 16.0 2187.0 3 8 9 2.8 18.0 6561.0 4 9 10 3.0 20.0 19683.0 4
- agg(agg_list, group_by=None)
Returns a SAP HANA DataFrame with the group_by column along with the aggregates. This method supports all aggregation functions in the SAP HANA database instance, such as 'max', 'min', 'count', 'avg', 'sum', 'median', 'stddev', 'var'. The name of the column in the returned DataFrame is the same as the original column.
Aggregation functions can be referred to SAP HANA aggregate functions.
- Parameters:
- agg_listA list of tuples
A list of tuples. Each tuple is a triplet. The triplet consists of (aggregate_operator, expression, name) where:
aggregate_operator is one of ['max', 'min', 'count', 'avg', 'sum', 'median', 'stddev', 'var', ...]. The operator name is identical to SAP HANA sql naming and we support all aggregation functions in the SAP HANA database instance.
expression is a str that is a column or column expression
name is the name of this aggregate in the project list.
- group_bystr or list of str, optional
The group by column. Only a column is allowed although expressions are allowed in SQL. To group by an expression, create a DataFrame by providing the entire SQL. So, if you have a table T with columns C1, C2, and C3 that are all integers, to calculate the max(C1) grouped by (C2+C3) a DataFrame would need to be created as below:
>>> cc.sql('SELECT "C2"+"C3", max("C1") FROM "T" GROUP BY "C2"+"C3"')
- Returns:
- DataFrame
A DataFrame containing the group_by column (if it exists), as well as the aggregate expressions that are aliased with the specified names.
Examples
Input:
>>> df.collect() ID SEPALLENGTHCM SEPALWIDTHCM PETALLENGTHCM PETALWIDTHCM SPECIES 0 1 5.1 3.5 1.4 0.2 Iris-setosa 1 2 4.9 3.0 1.4 0.2 Iris-setosa 2 3 4.7 3.2 1.3 0.2 Iris-setosa 3 51 7.0 3.2 4.7 1.4 Iris-versicolor 4 52 6.4 3.2 4.5 1.5 Iris-versicolor 5 101 6.3 3.3 6.0 2.5 Iris-virginica 6 102 5.8 2.7 5.1 1.9 Iris-virginica 7 103 7.1 3.0 5.9 2.1 Iris-virginica 8 104 6.3 2.9 5.6 1.8 Iris-virginica
Another way to do a count:
>>> df.agg([('count', 'SPECIES', 'COUNT')]).collect() COUNT 0 9
Get counts by SPECIES:
>>> df.agg([('count', 'SPECIES', 'COUNT')], group_by='SPECIES').collect() SPECIES COUNT 0 Iris-versicolor 2 1 Iris-virginica 4 2 Iris-setosa 3
Get max values of SEPALLENGTHCM by SPECIES:
>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH')], group_by='SPECIES').collect() SPECIES MAX_SEPAL_LENGTH 0 Iris-versicolor 7.0 1 Iris-virginica 7.1 2 Iris-setosa 5.1
Get max and min values of SEPALLENGTHCM by SPECIES:
>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH'), ('min', 'SEPALLENGTHCM', 'MIN_SEPAL_LENGTH')], group_by=['SPECIES']).collect() SPECIES MAX_SEPAL_LENGTH MIN_SEPAL_LENGTH 0 Iris-versicolor 7.0 6.4 1 Iris-virginica 7.1 5.8 2 Iris-setosa 5.1 4.7
Get aggregate grouping by multiple columns:
>>> df.agg([('count', 'SEPALLENGTHCM', 'COUNT_SEPAL_LENGTH')], group_by=['SPECIES', 'PETALLENGTHCM']).collect() SPECIES PETALLENGTHCM COUNT_SEPAL_LENGTH 0 Iris-virginica 6.0 1 1 Iris-setosa 1.3 1 2 Iris-virginica 5.9 1 3 Iris-virginica 5.6 1 4 Iris-setosa 1.4 2 5 Iris-versicolor 4.7 1 6 Iris-versicolor 4.5 1 7 Iris-virginica 5.1 1
- is_numeric(cols=None)
Returns True if the column(s) in the DataFrame are numeric.
- Parameters:
- colsstr or list, optional
The column(s) to be tested for being numeric.
Defaults to all columns.
- Returns:
- bool
True if all the columns are numeric.
Examples
Input:
>>> df.head(5).collect() ID SEPALLENGTHCM SEPALWIDTHCM PETALLENGTHCM PETALWIDTHCM SPECIES 0 1 5.1 3.5 1.4 0.2 Iris-setosa 1 2 4.9 3.0 1.4 0.2 Iris-setosa 2 3 4.7 3.2 1.3 0.2 Iris-setosa 3 4 4.6 3.1 1.5 0.2 Iris-setosa 4 5 5.0 3.6 1.4 0.2 Iris-setosa
>>> pprint.pprint(df.dtypes()) [('ID', 'INT', 10, 10, 10, 0), ('SEPALLENGTHCM', 'DOUBLE', 15, 15, 15, 0), ('SEPALWIDTHCM', 'DOUBLE', 15, 15, 15, 0), ('PETALLENGTHCM', 'DOUBLE', 15, 15, 15, 0), ('PETALWIDTHCM', 'DOUBLE', 15, 15, 15, 0), ('SPECIES', 'NVARCHAR', 15, 15, 15, 0)]
Test a single column:
>>> df.is_numeric('ID') True >>> df.is_numeric('SEPALLENGTHCM') True >>> df.is_numeric(['SPECIES']) False
Test a list of columns:
>>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'PETALWIDTHCM']) True >>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'SPECIES']) False
- corr(first_col, second_col)
Returns a DataFrame that gives the correlation coefficient between two numeric columns.
All rows with NULL values for
first_col
orsecond_col
are removed prior to calculating the correlation coefficient.Let col1 be the values of
first_col
and col2 be the values ofsecond_col
, then the correlation coefficient is:1/(n-1) * sum((col1 - avg(col1)) * (col2 - avg(col2))) / (stddev(col1) * stddev(col2))
- Parameters:
- first_colstr
The first column for calculating the correlation coefficient.
- second_colstr
The second column for calculating the correlation coefficient.
- Returns:
- DataFrame
A DataFrame with one value that contains the correlation coefficient. The name of the column is CORR_COEFF.
Examples
Input:
>>> df.columns ['C1', 'C2', 'C3', 'C4'] >>> df.collect() C1 C2 C3 C4 0 1 1.2 2.0 1.0 1 2 1.4 4.0 3.0 2 3 1.6 8.0 9.0 3 4 1.8 16.0 27.0 4 5 2.0 32.0 81.0 5 6 2.2 64.0 243.0 6 7 2.4 128.0 729.0 7 8 2.6 256.0 2187.0 8 9 2.8 512.0 6561.0 9 10 3.0 1024.0 19683.0
Correlation with columns that are well correlated:
>>> df.corr('C1', 'C2').collect() CORR_COEFF 0 1.0
>>> df.corr('C1', 'C3').collect() CORR_COEFF 0 1.0
Correlation with a column whose value is three times its previous value:
>>> df.corr('C1', 'C4').collect() CORR_COEFF 0 0.696325
- min()
Gets the minimum value of the columns. It simplifies the use of agg function.
- Returns:
- scalar or Series
- max()
Gets the maximum value of the columns. It simplifies the use of agg function.
- Returns:
- scalar or Series
- sum()
Gets the summation of the columns. It simplifies the use of agg function.
- Returns:
- scalar or Series
- median()
Gets the median value of the columns. It simplifies the use of agg function.
- Returns:
- scalar or Series
- mean()
Gets the mean value of the columns. It simplifies the use of agg function.
- Returns:
- scalar or Series
- stddev()
Gets the stddev value of the columns. It simplifies the use of agg function.
- Returns:
- scalar or Series
- value_counts(subset=None)
Gets the value counts of the columns. It simplifies the use of agg function.
- Parameters:
- subsetlist, optional
Columns to use when counting unique combinations.
- Returns:
- DataFrame
- pivot_table(values, index, columns, aggfunc='avg')
Returns a DataFrame that gives the pivoted table.
aggfunc
is identical to SAP HANA aggregate functions.- Parameters:
- valuesstr or list of str
The targeted values for pivoting.
- indexstr or list of str
The index of the DataFrame.
- columnsstr or list of str
The pivoting columns.
- aggfunc{'avg', 'max', 'min',... }, optional
aggfunc
is identical to SAP HANA aggregate functions. Defaults to 'avg'.
- Returns:
- DataFrame
A pivoted DataFrame.
Examples
df is a SAP HANA DataFrame.
>>> df.pivot_table(values='C2', index='C1', columns='C3', aggfunc='max')
- generate_table_type()
Generates a SAP HANA table type based on the dtypes function of the DataFrame. This is a convenience method for SQL tracing.
- Parameters:
- None
- Returns:
- str
The table type in string form.
- rearrange(key=None, features=None, label=None, type_ts=False, for_predict=False)
Utility function to generate a new dataframe with [key, features, label] for non time-series dataset and [key, label, features] for time-series dataset.
- Parameters:
- keystr, optional
Name of the ID column.
If
key
is not provided, then:if
data
is indexed by a single column, thenkey
defaults to that index column;otherwise, it is assumed that
data
contains no ID column.
- featureslist of str, optional
Names of the feature columns.
If
features
is not provided, it defaults to all non-ID, non-label columns.- labelstr, optional
Name of the dependent variable.
Defaults to
the name of the last non-ID column if
type_ts
is Falsethe name of the 1st non-ID column if
type_ts
is True.
- type_tsstr, optional
Specifies whether or not the input DataFrame is time-series data.
Defaults to False.
- for_predictstr, optional
Specifies whether
data
is for predict, in which caselabel
should not be provided.Defaults to False.
- set_source_table(table, schema=None)
Specifies the source table for the current dataframe.
- Parameters:
- tablestr
The table name.
- schemastr, optional, keyword-only
The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
- to_pickle(path, compression='infer', protocol=4)
Pickle object to file.
- Parameters:
- pathstr
File path where the pickled object will be stored.
- compression{'infer', 'gzip', 'bz2', 'zip', 'xz', None}
A string representing the compression to use in the output file. By default, infers from the file extension in specified path.
Defaults to 'infer'.
- protocolint
Int which indicates which protocol should be used by the pickler, default HIGHEST_PROTOCOL.
- to_datetime(cols)
Converts target columns to the specified date format. Default format is "YYYY-MM-DD HH24:MI:SS".
- Parameters:
- colsstr, list of str or dict
If cols is str or list of str, the default format will be used. Otherwise, the specified format should be provided in dict. E.g. cols={"DATETIME": "MM/DD/YYYY HH24:MI:SS"}
- generate_feature(targets, group_by=None, agg_func=None, trans_func=None, order_by=None, trans_param=None, rolling_window=None, second_targets=None)
Add additional features to the existing dataframe using agg_func and trans_func.
- Parameters:
- targetsstr or list of str
The column(s) in data to be feature engineered.
- group_bystr, optional
The column in data for group by when performing agg_func.
- agg_funcstr, optional
HANA aggregation operations. SUM, COUNT, MIN, MAX, ...
- trans_funcstr, optional
HANA transformation operations. MONTH, YEAR, LAG, ...
A special transformation is GEOHASH_HIERARCHY. This creates features based on a GeoHash. The default length of 20 for the hash can be influenced by respective trans parameters. Providing for example range(3, 11), the operation adds 7 features with a length of the GeoHash between 3 and 10.
- order_bystr, optional
LEAD, LAG function requires an OVER(ORDER_BY) window specification.
- trans_paramlist, optional
Parameters for transformation operations corresponding to targets.
- rolling_windowint, optional
Window size for rolling function. If negative, it will use the points before CURRENT ROW.
- second_targetsstr or list of str
The second column(s) in data to be feature engineered like CORR.
- Returns:
- DataFrame
SAP HANA DataFrame with new features.
Examples
>>> df.head(5).collect() TIME TEMPERATURE HUMIDITY OXYGEN CO2 0 2021-01-01 12:00:00 19.972199 29.271170 23.154523 504.806395 1 2021-01-01 12:00:10 19.910014 27.931855 23.009835 507.515937 2 2021-01-01 12:00:20 19.834676 26.051309 22.756407 510.111974 3 2021-01-01 12:00:30 19.952517 26.007655 22.737376 516.993696 4 2021-01-01 12:00:40 20.163497 26.056979 22.469276 528.337481 >>> df.generate_feature(targets=["TEMPERATURE", "HUMIDITY", "OXYGEN", "CO2"], trans_func="LAG", order_by="TIME", trans_param=[range(1, 7), range(1, 5), range(1, 5), range(1,7)]).dropna().deselect("TIME").head(2).collect() TEMPERATURE HUMIDITY OXYGEN CO2 LAG(TEMPERATURE, 1) ... LAG(CO2, 4) 0 20.978001 26.187823 21.982030 522.731895 20.701740 ... 510.111974 1 21.234148 25.703989 21.804864 528.066402 20.978001 ... 516.993696
- mutate(**kwargs)
Mutate columns in the dataframe.
Examples
>>> data.mutate(new_col=data.ID+2, data=data.data*3)
- hana_ml.dataframe.read_pickle(connection_context, path, table_name, compression='infer', schema=None, force=False, replace=True, object_type_as_bin=False, table_structure=None, **kwargs)
Loads a pickled DataFrame object from file.
- Parameters:
- connection_contextConnectionContext
A connection to a SAP HANA database instance.
- pathstr
File path where the pickled object will be loaded.
- table_namestr
The table name in the SAP HANA database.
- compression{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, optional
For on-the-fly decompression of on-disk data.
If 'infer', then use gzip, bz2, xz or zip if path ends in '.gz', '.bz2', '.xz', or '.zip' respectively, and no decompression otherwise.
Set to None for no decompression.
Defaults to 'infer'.
- schemastr, optional, keyword-only
The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
Defaults to the current schema.
- forcebool, optional
If force is True, then the SAP HANA table with table_name is dropped.
Defaults to False.
- replacebool, optional
If replace is True, then the SAP HANA table performs the missing value handling.
Defaults to True.
- object_type_as_binbool, optional
If True, the object type will be considered CLOB in SAP HANA.
Defaults to False.
- table_structuredict, optional
Manually define column types based on SAP HANA DB table structure.
Defaults to None.
- hana_ml.dataframe.create_dataframe_from_pandas(connection_context, pandas_df, table_name, schema=None, force=False, replace=False, object_type_as_bin=False, table_structure=None, drop_exist_tab=True, allow_bigint=False, geo_cols: list = None, srid: int = 4326, primary_key: str = None, not_nulls: list = None, chunk_size=50000, disable_progressbar=False, upsert=False, append=False)
Uploads data from a Pandas DataFrame to a SAP HANA database and returns an SAP HANA DataFrame.
- Parameters:
- connection_contextConnectionContext
A connection to the SAP HANA database instance.
- pandas_dfpandas.DataFrame
A Pandas DataFrame for uploading to the SAP HANA database. This can also be a GeoPandas dataframe, which will automatically be converted to a Pandas DataFrame.
- table_namestr
The table name in the SAP HANA database.
- schemastr, optional, keyword-only
The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
Defaults to the current schema.
- forcebool, optional
If force is True, then the SAP HANA table with table_name is truncated or dropped.
Defaults to False.
- replacebool, optional
If replace is True, NULL will be replaced by 0.
Defaults to False.
- object_type_as_binbool, optional
If True, the object type will be considered CLOB in SAP HANA.
Defaults to False.
- table_structuredict
Manually define column types based on SAP HANA DB table structure.
- drop_exist_tabbool, optional
If force is True, drop the existing table when drop_exist_tab is True and truncate the existing table when it is False.
Defaults to True.
- allow_bigintbool, optional
allow_bigint decides whether int64 is mapped into INT or BIGINT in HANA.
Defaults to False.
- geo_colslist, optional but required for spatial functions
Specifies the columns of the dataframe, which are treated as geometries. List elements can be either strings or tuples.
Strings represent columns which contain geometries in (E)WKT format. If the provided DataFrame is a GeoPandas DataFrame, you do not need to add the geometry column to the geo_cols. It will be detected and added automatically.
The column name in the HANA Table will be <column_name>_GEO
Tuples must consist of two or strings: (<longitude column>, <latitude column>)
longitude column: Dataframe column, that contains the longitude values
latitude column: Dataframe column, that contains the latitude values
They will be combined to a POINT(<longiturd> <latitude>) geometry.
The column name in the HANA Table will be <longitude>_<latitude>_GEO
Defaults to None.
- sridint, optional but required for spatial functions
Spatial reference system id. If the SRS is not created yet, the system tries to create it automatically.
Defaults to 4326.
- primary_keystr, optional but required for Graph functions
Name of the column in a node table which contains the unique identification of the node and corresponds with the edge table.
Defaults to None.
- not_nullslist, optional but required for Graph functions
Contains all column names which should get a not null constraint. This is primarily for creating node and edge tables for a graph workspace
Defaults to None.
- chunk_sizeint, optional
Specify the chunk size for upload.
Defaults to 50000.
- disable_progressbarbool, optional
Disable the progress bar.
Defaults to False.
- upsertbool, optional
Enable upsert with primary key if True.
Defaults to False.
- appendbool, optional
Enable append if True.
Defaults to False.
- Returns:
- DataFrame
A SAP HANA DataFrame that contains the data in the pandas_df.
Examples
>>> create_dataframe_from_pandas(connection_context,p_df,'test',force=False,replace=True,drop_exist_tab=False) <hana_ml.dataframe.DataFrame at 0x7efbcb26fbe0>
>>> create_dataframe_from_pandas( connection_context, p_df, "geo_table", force=False, replace=True, drop_exist_tab=False, geo_cols=["geometry", ("long", "lat")], ) <hana_ml.dataframe.DataFrame at 0x5eabcb27fbe0>
- hana_ml.dataframe.create_dataframe_from_spark(connection_context, spark_df, table_name, schema=None, force=False, object_type_as_bin=False, table_structure=None, drop_exist_tab=True, allow_bigint=False, primary_key: str = None, not_nulls=None, chunk_size=50000, disable_progressbar=False, upsert=False, append=False)
Uploads data from a Spark DataFrame to a SAP HANA database and returns an SAP HANA DataFrame.
- Parameters:
- connection_contextConnectionContext
A connection to the SAP HANA database instance.
- spark_dfpandas.DataFrame
A Spark DataFrame for uploading to the SAP HANA database.
- table_namestr
The table name in the SAP HANA database.
- schemastr, optional, keyword-only
The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
Defaults to the current schema.
- forcebool, optional
If force is True, then the SAP HANA table with table_name is truncated or dropped.
Defaults to False.
- object_type_as_binbool, optional
If True, the object type will be considered CLOB in SAP HANA.
Defaults to False.
- table_structuredict
Manually define column types based on SAP HANA DB table structure.
- drop_exist_tabbool, optional
If force is True, drop the existing table when drop_exist_tab is True and truncate the existing table when it is False.
Defaults to True.
- allow_bigintbool, optional
allow_bigint decides whether int64 is mapped into INT or BIGINT in HANA.
Defaults to False.
- primary_keystr, optional
Name of the column in a node table which contains the unique identification of the node and corresponds with the edge table.
Defaults to None.
- not_nullslist, optional
Contains all column names which should get a not null constraint.
Defaults to None.
- chunk_sizeint, optional
Specify the chunk size for upload.
Defaults to 50000.
- disable_progressbarbool, optional
Disable the progress bar.
Defaults to False.
- upsertbool, optional
Enable upsert with primary key if True.
Defaults to False.
- appendbool, optional
Enable append if True.
Defaults to False.
- Returns:
- DataFrame
A SAP HANA DataFrame that contains the data in the Spark DataFrame.
Examples
>>> create_dataframe_from_spark(connection_context,spark_df,'test',force=False,replace=True,drop_exist_tab=False)
- hana_ml.dataframe.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=None)
Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.
- Parameters:
- frameDataFrame
A SAP HANA DataFrame.
- id_varsstr, tuple or list, optional
Column(s) to use as identifier variables.
Defaults to None.
- value_varstuple or list, optional
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
- var_namescalar, optional
Name to use for the 'variable' column. If None it uses frame.columns.name or 'variable'.
- value_namescalar, default 'value', optional
Name to use for the 'value' column.
- Returns:
- DataFrame
Unpivoted DataFrame.
Examples
>>> data.collect() A B C 0 a 1 2 1 b 3 4 2 c 5 6
>>> melt(data, id_vars=['A'], value_vars=['B', 'C']).collect() A variable value 0 a B 1 1 a C 2 2 b B 5 3 b C 6 4 c B 3 5 c C 4
- hana_ml.dataframe.create_dataframe_from_shapefile(connection_context: ConnectionContext, shp_file: str, srid: int, table_name: str, schema: str = None) DataFrame
Given a shapefile change the file into a DataFrame so that it is backed in SAP HANA. Expects that the shapefile name is a zip and/or will have both shp and dbf parts to create the single table. Once the table is created temporarily and locally, import the table as a shapefile into the target table through direct insertion. Then return the SAP HANA Dataframe that can be visualized.
- Parameters:
- connection_contextConnectionContext
A connection to the SAP HANA database instance.
- shp_filestr
Path to a zipfile, shapefile or dbf-file. Filename suffix will be dropped and attempt to load both dbf and shp by that base name. Allowed suffixes are: .zip, .shp, .shx, and .dbf
- sridint
The spatial reference id that applies to the list of columns in geo_cols. If the SRS is not created yet, the system tries to create it automatically.
- table_namestr
The table name in the SAP HANA database.
- schemastr, optional, keyword-only
The schema name. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
Defaults to the current schema.
- Returns:
- DataFrame
A SAP HANA DataFrame with geometric columns that contains the data from the shp_file.
Examples
>>> cc = connection_context >>> shapefile_path = os.path.join(os.getcwd(), 'myshape.shp') >>> hana_df = create_dataframe_from_shapefile( ... connection_context=cc, ... shp_file=shapefile_path, ... srid=4326, ... table_name="myshape_tbl")
- hana_ml.dataframe.import_csv_from(connection_context, directory, table, schema=None, threads=1, record_delimiter=None, field_delimiter=None, escape_character=None, column_list_first_row=None, credential=None)
Imports a csv file into the SAP HANA system. More details is shown in the <storage_path> section of chapter IMPORT FROM Statement (Data Import Export) of SAP HANA Cloud, SAP HANA Database SQL Reference Guide.
- Parameters:
- connection_contextConnectionContext
A connection to the SAP HANA database instance.
- directorystr
Specifies the cloud storage location for the import. The locations HANA cloud support are Azure, Amazon(AWS) Google Cloud, SAP HANA Cloud, Data Lake Files(HDLFS).
- tablestr
Specifies the name of target table.
Defaults to None.
- schemastr, optional
Specifies the schema name of target table.
Defaults to None.
- threadsint, optional
Specifies the number of threads that can be used for concurrent import.
threads
andbatch
provide high loading performance by enabling parallel loading and also by committing many records at once. In general, for column tables, a good setting to use is 10 parallel loading threads, with a commit frequency of 10,000 records or greater.Defaults to 1 and then the maximum allowed value is 256.
- record_delimiterstr, optional
Specifies the record delimiters used in the CSV file.
Defaults to None.
- field_delimiterstr, optional
Specifies the field delimiters used in the CSV file.
Defaults to None.
- escape_characterstr, optional
Specifies the escape character used in the import data.
Defaults to None.
- column_list_first_rowbool, optional
Indicates that the column list is stored in the first row.
Defaults to None.
- credentialstr, optional
Specifies the name of the credential defined in the CREATE CREDENTIAL statement. Since the credentials are defined within the credential, they no longer appear as plain text as part of import statements. The WITH CREDENTIAL clause cannot be specified when
directory
contains credentials. The WITH CREDENTIAL clause is required for imports to SAP HANA Cloud, Data Lake Files, but is optional for all other cloud platforms.Defaults to None.
Examples
Assume we have a connection to a SAP HANA instance called conn and we want to import test.csv into a table called Test:
>>> import_csv_from(connection_context=conn, directory='hdlfs://XXXXXXXX.com/test.csv', table="Test", threads=10, column_list_first_row=True, credential='XXXCredential')