hana_ml.dataframe¶
This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly asked for.
The following classes and functions are available:
-
class
hana_ml.dataframe.
ConnectionContext
(address='', port=0, user='', password='', autocommit=True, packetsize=None, userkey=None, **properties)¶ Bases:
object
Represents a connection to a HANA system.
ConnectionContext includes methods for creating DataFrames from data on HANA. DataFrames are tied to a ConnectionContext, and are unusable once their ConnectionContext is closed.
Parameters: - Same as hdbcli.dbapi.connect. (See notes.)
Notes
The online docs for hdbcli.dbapi.connect are available at https://help.sap.com/viewer/0eec0d68141541d1b07893a39944924e/latest/en-US/ee592e89dcce4480a99571a4ae7a702f.html.
Examples
>>> with ConnectionContext('address', port, 'user', 'password') as cc: ... df = (cc.table('MY_TABLE', schema='MY_SCHEMA') ... .filter('COL3 > 5') ... .select('COL1', 'COL2')) ... pandas_df = df.collect()
Methods
close
()Closes the existing connection. sql
(sql)Returns a DataFrame representing a query. table
(table, *[, schema])Returns a DataFrame representing the specified table. -
close
()¶ Closes the existing connection.
Parameters: - None
Returns: - None
-
sql
(sql)¶ Returns a DataFrame representing a query.
Parameters: - sql : str
SQL query.
Returns: - DataFrame
DataFrame for the query.
Examples
>>> df = cc.sql('SELECT T.A, T2.B FROM T, T2 WHERE T.C=T2.C')
-
table
(table, *, schema=None)¶ Returns a DataFrame representing the specified table.
Parameters: - table : str
Table name.
- schema : str, optional, keyword-only
Schema name. If not provided or set to None, defaults to the ConnectionContext’s current schema.
Returns: - DataFrame
DataFrame selecting data from that table.
Examples
>>> df1 = cc.table('MY_TABLE') >>> df2 = cc.table('MY_OTHER_TABLE', schema='MY_SCHEMA')
-
class
hana_ml.dataframe.
DataFrame
(connection_context, select_statement, _name=None)¶ Bases:
object
This class represents a frame that is backed by a database sql statement.
Parameters: - connection_context : ConnectionContext
Connection to the HANA system.
- select_statement : str
SQL query backing the dataframe.
Notes
Parameters beyond connection_context and select_statement are intended for internal usage. Do not rely on them; they may change without notice.
Attributes: columns
Lists the current DataFrame’s column names.
name
Returns the name of the DataFrame.
quoted_name
The escaped name of the original Dataframe.
Methods
agg
(agg_list[, group_by])Returns a DataFrame with the group_by column along with the aggregates. alias
(alias)Returns a new DataFrame with an alias set. bin
(col[, strategy, bins, bin_width, bin_column])Returns a DataFrame with original columns along with bin assignments. cast
(cols, new_type)Returns a DataFrame with columns cast to a new type. collect
()Copies the current DataFrame to a new Pandas DataFrame. corr
(first_col, second_col)Returns a DataFrame that gives the correlation coefficient between two numeric columns. count
()Computes the number of rows in the DataFrame. describe
([cols])Returns a DataFrame containing various statistics for the requested column(s). distinct
([cols])Returns a new DataFrame with distinct values for the specified columns. drop
(cols)Returns a new DataFrame without specified columns. drop_duplicates
([subset])Returns a new DataFrame with duplicate rows removed. dropna
([how, thresh, subset])Returns a new DataFrame with NULLs removed. dtypes
([subset])Returns a sequence of tuples describing the DataFrame’s SQL types. empty
()Returns True if this DataFrame has 0 rows. fillna
(value[, subset])Returns a DataFrame with NULLs replaced with a specified value. filter
(condition)Selects rows matching the given condition. has
(col)Returns True if a column is in the DataFrame. hasna
([cols])Returns True if a DataFrame contains NULLs. head
([n])Returns a DataFrame of the first n rows in the current DataFrame. is_numeric
([cols])Returns True if the column(s) in the DataFrame are numeric. join
(other, condition[, how, select])Returns a new DataFrame that is a join of the current DataFrame with another specified DataFrame. rename_columns
(names)Returns a DataFrame with renamed columns. save
(where[, table_type])Creates a table or view holding the current DataFrame’s data. select
(*cols)Returns a new DataFrame with columns derived from the current DataFrame. sort
(cols[, desc])Returns a new DataFrame sorted by the specified columns. to_head
(col)Returns a DataFrame with specified column as the first item in projection. union
(other[, all])Combine this DataFrame’s rows and another DataFrame’s rows into one DataFrame. -
agg
(agg_list, group_by=None)¶ Returns a DataFrame with the group_by column along with the aggregates.
The name of the column in the returned DataFrame is the same as the original column.
Parameters: - agg_list : list of tuple
A list of tuples. Each tuple is a triplet. The triplet consists of (aggregate_operator, expression, name) where:
aggregate_operator is one of [‘max’, ‘min’, ‘count’, ‘avg’]
expression is a str that is a column or column expression
name is the name of this aggregate in the project list.
- group_by : str or list of str, optional
Group by column. Note that only a column is allowed although expressions are allowed in SQL. To group by an expression, a dataframe would have to be created by providing the entire SQL. So, if you have a table T with columns C1, C2, and C3 that are all integers, to calculate the max(C1) grouped by (C2+C3) a DataFrame would need to be created as below:
cc.sql(‘SELECT “C2”+”C3”, max(“C1”) FROM “T” GROUP BY “C2”+”C3”’)
Returns: - DataFrame
DataFrame containing the group_by column if it exists along with the aggregate expressions that are aliased with the specified names.
Examples
Input:
>>> df.collect() ID SEPALLENGTHCM SEPALWIDTHCM PETALLENGTHCM PETALWIDTHCM SPECIES 0 1 5.1 3.5 1.4 0.2 Iris-setosa 1 2 4.9 3.0 1.4 0.2 Iris-setosa 2 3 4.7 3.2 1.3 0.2 Iris-setosa 3 51 7.0 3.2 4.7 1.4 Iris-versicolor 4 52 6.4 3.2 4.5 1.5 Iris-versicolor 5 101 6.3 3.3 6.0 2.5 Iris-virginica 6 102 5.8 2.7 5.1 1.9 Iris-virginica 7 103 7.1 3.0 5.9 2.1 Iris-virginica 8 104 6.3 2.9 5.6 1.8 Iris-virginica
Another way to do a count.
>>> df.agg([('count', 'SPECIES', 'COUNT')]).collect() COUNT 0 9
Get counts by SPECIES
>>> df.agg([('count', 'SPECIES', 'COUNT')], group_by='SPECIES').collect() SPECIES COUNT 0 Iris-versicolor 2 1 Iris-virginica 4 2 Iris-setosa 3
Get max values of SEPALLENGTHCM by SPECIES
>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH')], group_by='SPECIES').collect() SPECIES MAX_SEPAL_LENGTH 0 Iris-versicolor 7.0 1 Iris-virginica 7.1 2 Iris-setosa 5.1
Get max and min values of SEPALLENGTHCM by SPECIES
>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH'), ('min', 'SEPALLENGTHCM', 'MIN_SEPAL_LENGTH')], group_by=['SPECIES']).collect() SPECIES MAX_SEPAL_LENGTH MIN_SEPAL_LENGTH 0 Iris-versicolor 7.0 6.4 1 Iris-virginica 7.1 5.8 2 Iris-setosa 5.1 4.7
Get aggregate grouping by multiple columns.
>>> df.agg([('count', 'SEPALLENGTHCM', 'COUNT_SEPAL_LENGTH')], group_by=['SPECIES', 'PETALLENGTHCM']).collect() SPECIES PETALLENGTHCM COUNT_SEPAL_LENGTH 0 Iris-virginica 6.0 1 1 Iris-setosa 1.3 1 2 Iris-virginica 5.9 1 3 Iris-virginica 5.6 1 4 Iris-setosa 1.4 2 5 Iris-versicolor 4.7 1 6 Iris-versicolor 4.5 1 7 Iris-virginica 5.1 1
-
alias
(alias)¶ Returns a new DataFrame with an alias set.
Parameters: - alias : str
Name of the DataFrame.
Returns: - aliased_df : DataFrame
DataFrame with an alias set.
See also
DataFrame.rename_columns
- For renaming individual columns.
-
bin
(col, strategy='uniform_number', bins=None, bin_width=None, bin_column='BIN_NUMBER')¶ Returns a DataFrame with original columns along with bin assignments.
The name of the columns in the returned DataFrame is the same as the original column. Column “BIN_NUMBER” or the specified value in bin_column is added and corresponds to the bin assigned.
Parameters: - col : str
The column on which binning is performed. The column must be numeric.
- strategy : {‘uniform_number’, ‘uniform_size’}
- Binning methods:
- ‘uniform_number’: Equal widths based on the number of bins.
- ‘uniform_size’: Equal widths based on the bin size.
- bins : int, optional
The number of equal width bins. Only valid when strategy is ‘uniform_number’. Defaults to 10.
- bin_width : int, optional
The interval width of each bin. Only valid when strategy is ‘uniform_size’.
- bin_column : str, optional
Name of the output column containing the bin number.
Returns: - DataFrame
Binned dataset with the same data as this DataFrame along with an additional column “BIN_NUMBER” or the value specified in bin_column. This additional column contains the assigned bin for each row.
Examples
Input:
>>> df.collect() C1 C2 C3 C4 0 1 1.2 2.0 1.0 1 2 1.4 4.0 3.0 2 3 1.6 6.0 9.0 3 4 1.8 8.0 27.0 4 5 2.0 10.0 81.0 5 6 2.2 12.0 243.0 6 7 2.4 14.0 729.0 7 8 2.6 16.0 2187.0 8 9 2.8 18.0 6561.0 9 10 3.0 20.0 19683.0
Create 5 bins of equal widths on C1
>>> df.bin('C1', strategy='uniform_number', bins=5).collect() C1 C2 C3 C4 BIN_NUMBER 0 1 1.2 2.0 1.0 1 1 2 1.4 4.0 3.0 1 2 3 1.6 6.0 9.0 2 3 4 1.8 8.0 27.0 2 4 5 2.0 10.0 81.0 3 5 6 2.2 12.0 243.0 3 6 7 2.4 14.0 729.0 4 7 8 2.6 16.0 2187.0 4 8 9 2.8 18.0 6561.0 5 9 10 3.0 20.0 19683.0 5
Create 5 bins of equal widths on C2
>>> df.bin('C3', strategy='uniform_number', bins=5).collect() C1 C2 C3 C4 BIN_NUMBER 0 1 1.2 2.0 1.0 1 1 2 1.4 4.0 3.0 1 2 3 1.6 6.0 9.0 2 3 4 1.8 8.0 27.0 2 4 5 2.0 10.0 81.0 3 5 6 2.2 12.0 243.0 3 6 7 2.4 14.0 729.0 4 7 8 2.6 16.0 2187.0 4 8 9 2.8 18.0 6561.0 5 9 10 3.0 20.0 19683.0 5
Create 5 bins of equal widths on a column that varies significantly
>>> df.bin('C4', strategy='uniform_number', bins=5).collect() C1 C2 C3 C4 BIN_NUMBER 0 1 1.2 2.0 1.0 1 1 2 1.4 4.0 3.0 1 2 3 1.6 6.0 9.0 1 3 4 1.8 8.0 27.0 1 4 5 2.0 10.0 81.0 1 5 6 2.2 12.0 243.0 1 6 7 2.4 14.0 729.0 1 7 8 2.6 16.0 2187.0 1 8 9 2.8 18.0 6561.0 2 9 10 3.0 20.0 19683.0 5
Create bins of equal width
>>> df.bin('C1', strategy='uniform_size', bin_width=3).collect() C1 C2 C3 C4 BIN_NUMBER 0 1 1.2 2.0 1.0 1 1 2 1.4 4.0 3.0 1 2 3 1.6 6.0 9.0 2 3 4 1.8 8.0 27.0 2 4 5 2.0 10.0 81.0 2 5 6 2.2 12.0 243.0 3 6 7 2.4 14.0 729.0 3 7 8 2.6 16.0 2187.0 3 8 9 2.8 18.0 6561.0 4 9 10 3.0 20.0 19683.0 4
-
cast
(cols, new_type)¶ Returns a DataFrame with columns cast to a new type.
- The name of the column in the returned DataFrame is the same as the original column.
Warning
Type compatibility between the existing column type and the new type is not checked. An incompatibility will result in an error.
Parameters: - cols : str or list
The column(s) to be cast to a different type.
- new_type : str
The database datatype to cast the column(s) to. No checks are performed to see if the new type is valid. An invalid type can lead to SQL errors or even SQL injection vulnerabilities.
Returns: - DataFrame
The same data as this DataFrame but with columns cast to the specified type.
Examples
Input:
>>> df1 = cc.sql('SELECT "AGE", "PDAYS", "HOUSING" FROM DBM_TRAINING_TBL') >>> df1.dtypes() [('AGE', 'INT', 10), ('PDAYS', 'INT', 10), ('HOUSING', 'VARCHAR', 100)]
Casting a column to NVARCHAR(20):
>>> df2 = df1.cast('AGE', 'NVARCHAR(20)') >>> df2.dtypes() [('AGE', 'NVARCHAR', 20), ('PDAYS', 'INT', 10), ('HOUSING', 'VARCHAR', 100)]
Casting a list of columns to NVARCHAR(50):
>>> df3 = df1.cast(['AGE', 'PDAYS'], 'NVARCHAR(50)') >>> df3.dtypes() [('AGE', 'NVARCHAR', 50), ('PDAYS', 'NVARCHAR', 50), ('HOUSING', 'VARCHAR', 100)]
-
collect
()¶ Copies the current DataFrame to a new Pandas DataFrame.
Parameters: - none
Returns: - pandas_df : pandas.DataFrame
Pandas DataFrame containing the current DataFrame’s data.
Examples
Viewing a hana_ml DataFrame doesn’t reveal much on its own, because it doesn’t execute the underlying SQL or fetch data:
>>> df = cc.table('T') >>> df <hana_ml.dataframe.DataFrame object at 0x7f2b7f24ddd8>
Using collect() will execute the SQL and fetch the results into a Pandas DataFrame:
>>> df.collect() A B 0 1 3 1 2 4 >>> type(df.collect()) <class 'pandas.core.frame.DataFrame'>
-
columns
¶ Lists the current DataFrame’s column names. Computed lazily and cached.
Each access to this property creates a new copy; mutating the list will not alter or corrupt the DataFrame.
-
corr
(first_col, second_col)¶ Returns a DataFrame that gives the correlation coefficient between two numeric columns.
All rows with NULL values for first_col or second_col are removed prior to calculating the correlation coefficient.
- The correlation coefficient is:
- 1/(n-1) * sum((col1_value - avg(col1)) * (col2 - avg(col2))) / (stddev(col1) * stddev(col2))
Parameters: - first_col : str
The first column for calculating the correlation coefficient.
- first_col : str
The second column for calculating the correlation coefficient.
Returns: - DataFrame
A DataFrame with one value containing the correlation coefficient. The name of the column is CORR_COEFF.
Examples
Input:
>>> df.columns ['C1', 'C2', 'C3', 'C4'] >>> df.collect() C1 C2 C3 C4 0 1 1.2 2.0 1.0 1 2 1.4 4.0 3.0 2 3 1.6 8.0 9.0 3 4 1.8 16.0 27.0 4 5 2.0 32.0 81.0 5 6 2.2 64.0 243.0 6 7 2.4 128.0 729.0 7 8 2.6 256.0 2187.0 8 9 2.8 512.0 6561.0 9 10 3.0 1024.0 19683.0
Correlation with columns that are well correlated. >>> df.corr(‘C1’, ‘C2’).collect()
CORR_COEFF0 1.0 >>> df.corr(‘C1’, ‘C3’).collect()
CORR_COEFF0 1.0
Correlation with a column whose value is 3 times its previous value >>> df.corr(‘C1’, ‘C4’).collect()
CORR_COEFF0 0.696325
-
count
()¶ Computes the number of rows in the DataFrame.
Parameters: - none
Returns: - count : int
Number of rows in the DataFrame.
-
describe
(cols=None)¶ Returns a DataFrame containing various statistics for the requested column(s).
Parameters: - cols : str or list
The column(s) to be described. Defaults to all columns.
Returns: - DataFrame
A DataFrame containing statistics for the specified column(s) in the current DataFrame.
The statistics included are the count of rows (“count”), number of distinct values (“unique”), number of nulls (“nulls”), average (“mean”), standard deviation(“std”) median (“median”), minimum value (“min”), maximum value (“max”), 25% percentile when treated as continuous variable (“25_percent_cont”), 25% percentile when treated as discrete variable (“25_percent_disc”), 50% percentile when treated as continuous variable (“50_percent_cont”), 50% percentile when treated as discrete variable (“50_percent_disc”), 75% percentile when treated as continuous variable (“75_percent_cont”), 75% percentile when treated as discrete variable (“75_percent_disc”).
For columns that are strings, statistics such as average (“mean”), standard deviation (“std”), median (“median”), and the various percentiles are NULLs.
If the list of columns contain both string and numeric data types, minimum and maximum values become NULLs.
Examples
Input:
>>> df1 = cc.table("DBM_TRAINING") >>> import pprint >>> pprint.pprint(df2.columns) ['LABEL', 'ID', 'AGE', 'JOB', 'MARITAL', 'EDUCATION', 'DBM_DEFAULT', 'HOUSING', 'LOAN', 'CONTACT', 'DBM_MONTH', 'DAY_OF_WEEK', 'DURATION', 'CAMPAIGN', 'PDAYS', 'PREVIOUS', 'POUTCOME', 'EMP_VAR_RATE', 'CONS_PRICE_IDX', 'CONS_CONF_IDX', 'EURIBOR3M', 'NREMPLOYED']
Describe a few numeric columns and collect to return a Pandas DataFrame:
>>> df1.describe(['AGE', 'PDAYS']).collect() column count unique nulls mean std min max median 0 AGE 16895 78 0 40.051376 10.716907 17 98 38 1 PDAYS 16895 24 0 944.406688 226.331944 0 999 999 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 0 32.0 32 38.0 38 1 999.0 999 999.0 999 75_percent_cont 75_percent_disc 0 47.0 47 1 999.0 999
Describe a few non-numeric columns and collect to return a Pandas DataFrame:
>>> df1.describe(['JOB', 'MARITAL']).collect() column count unique nulls mean std min max median 0 JOB 16895 12 0 None None admin. unknown None 1 MARITAL 16895 4 0 None None divorced unknown None 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 0 None None None None 1 None None None None 75_percent_cont 75_percent_disc 0 None None 1 None None
Describe all columns in a DataFrame:
>>> df1.describe().collect() column count unique nulls mean std 0 ID 16895 16895 0 21282.286652 12209.759725 1 AGE 16895 78 0 40.051376 10.716907 2 DURATION 16895 1267 0 263.965670 264.331384 3 CAMPAIGN 16895 35 0 2.344658 2.428449 4 PDAYS 16895 24 0 944.406688 226.331944 5 PREVIOUS 16895 7 0 0.209529 0.539450 6 EMP_VAR_RATE 16895 10 0 -0.038798 1.621945 7 CONS_PRICE_IDX 16895 26 0 93.538844 0.579189 8 CONS_CONF_IDX 16895 26 0 -40.334123 4.865720 9 EURIBOR3M 16895 283 0 3.499297 1.777986 10 NREMPLOYED 16895 11 0 5160.371885 75.320580 11 JOB 16895 12 0 NaN NaN 12 MARITAL 16895 4 0 NaN NaN 13 EDUCATION 16895 8 0 NaN NaN 14 DBM_DEFAULT 16895 2 0 NaN NaN 15 HOUSING 16895 3 0 NaN NaN 16 LOAN 16895 3 0 NaN NaN 17 CONTACT 16895 2 0 NaN NaN 18 DBM_MONTH 16895 10 0 NaN NaN 19 DAY_OF_WEEK 16895 5 0 NaN NaN 20 POUTCOME 16895 3 0 NaN NaN 21 LABEL 16895 2 0 NaN NaN min max median 25_percent_cont 25_percent_disc 0 5.000 41187.000 21786.000 10583.500 10583.000 1 17.000 98.000 38.000 32.000 32.000 2 0.000 4918.000 184.000 107.000 107.000 3 1.000 43.000 2.000 1.000 1.000 4 0.000 999.000 999.000 999.000 999.000 5 0.000 6.000 0.000 0.000 0.000 6 -3.400 1.400 1.100 -1.800 -1.800 7 92.201 94.767 93.444 93.075 93.075 8 -50.800 -26.900 -41.800 -42.700 -42.700 9 0.634 5.045 4.856 1.313 1.313 10 4963.000 5228.000 5191.000 5099.000 5099.000 11 NaN NaN NaN NaN NaN 12 NaN NaN NaN NaN NaN 13 NaN NaN NaN NaN NaN 14 NaN NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN 16 NaN NaN NaN NaN NaN 17 NaN NaN NaN NaN NaN 18 NaN NaN NaN NaN NaN 19 NaN NaN NaN NaN NaN 20 NaN NaN NaN NaN NaN 21 NaN NaN NaN NaN NaN 50_percent_cont 50_percent_disc 75_percent_cont 75_percent_disc 0 21786.000 21786.000 32067.500 32068.000 1 38.000 38.000 47.000 47.000 2 184.000 184.000 324.000 324.000 3 2.000 2.000 3.000 3.000 4 999.000 999.000 999.000 999.000 5 0.000 0.000 0.000 0.000 6 1.100 1.100 1.400 1.400 7 93.444 93.444 93.994 93.994 8 -41.800 -41.800 -36.400 -36.400 9 4.856 4.856 4.961 4.961 10 5191.000 5191.000 5228.000 5228.000 11 NaN NaN NaN NaN 12 NaN NaN NaN NaN 13 NaN NaN NaN NaN 14 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 NaN NaN NaN NaN 17 NaN NaN NaN NaN 18 NaN NaN NaN NaN 19 NaN NaN NaN NaN 20 NaN NaN NaN NaN 21 NaN NaN NaN NaN
-
df_count
= 0¶
-
distinct
(cols=None)¶ Returns a new DataFrame with distinct values for the specified columns. If no columns are specified, the distinct row values from all columns are returned.
Parameters: - cols : str or list of str, optional
Column or list of columns to consider when getting distinct values. Defaults to use all columns.
Returns: - distinct_df : DataFrame
DataFrame with distinct values for cols.
Examples
Input:
>>> df.collect() A B C 0 1 A 100 1 1 A 101 2 1 A 102 3 1 B 100 4 1 B 101 5 1 B 102 6 1 B 103 7 2 A 100 8 2 A 100
Distinct values in a column:
>>> df.distinct("B").collect() B 0 A 1 B
Distinct values of a subset of columns:
>>> df.distinct(["A", "B"]).collect() A B 0 1 B 1 2 A 2 1 A
Distinct values of the entire data set:
>>> df.distinct().collect() A B C 0 1 A 102 1 1 B 103 2 1 A 101 3 2 A 100 4 1 B 101 5 1 A 100 6 1 B 100 7 1 B 102
-
drop
(cols)¶ Returns a new DataFrame without specified columns.
Parameters: - cols : list of str
List of column names to drop.
Returns: - filtered_df : DataFrame
New DataFrame retaining only columns not in cols.
Examples
>>> df.collect() A B 0 1 3 1 2 4 >>> df.drop(['B']).collect() A 0 1 1 2
-
drop_duplicates
(subset=None)¶ Returns a new DataFrame with duplicate rows removed. All columns in the DataFrame are returned. There is no way to keep specific duplicate rows.
Warning
Specifying a non-None value of subset may produce an “unstable” DataFrame, the contents of which may be different every time you look at it. Specifically, if two rows are duplicates in their subset columns and have different values in other columns, a different row could be picked every time you look at the result.
Parameters: - subset : list of str, optional
List of columns to consider when deciding whether rows are duplicates of each other. Defaults to use all columns.
Returns: - deduplicated_df : DataFrame
DataFrame with only one copy of duplicate rows.
Examples
Input:
>>> df.collect() A B C 0 1 A 100 1 1 A 101 2 1 A 102 3 1 B 100 4 1 B 101 5 1 B 102 6 1 B 103 7 2 A 100 8 2 A 100
Drop duplicates based on the values of a subset of columns:
>>> df.drop_duplicates(["A", "B"]).collect() A B C 0 1 A 100 1 1 B 100 2 2 A 100
Distinct values on the entire data set:
>>> df.drop_duplicates().collect() A B C 0 1 A 102 1 1 B 103 2 1 A 101 3 2 A 100 4 1 B 101 5 1 A 100 6 1 B 100 7 1 B 102
-
dropna
(how=None, thresh=None, subset=None)¶ Returns a new DataFrame with NULLs removed.
Parameters: - how : {‘any’, ‘all’}, optional
If provided, ‘any’ eliminates rows with any NULLs, and ‘all’ eliminates rows that are entirely NULLs. If neither how nor thresh are provided, how defaults to ‘any’.
- thresh : int, optional
If provided, values with less than the thresh non-NULL values are dropped. how and thresh cannot both be provided.
- subset : list of str, optional
Columns to consider when looking for NULLs. Values in other columns will be ignored, NULL or not. Defaults to all columns.
Returns: - filtered_df : DataFrame
New DataFrame with select statement that removes NULLs.
Examples
Dropping rows with any nulls:
>>> df.collect() A B C 0 1.0 3.0 5.0 1 2.0 4.0 NaN 2 3.0 NaN NaN 3 NaN NaN NaN >>> df.dropna().collect() A B C 0 1.0 3.0 5.0
Dropping rows that are entirely nulls:
>>> df.dropna(how='all').collect() A B C 0 1.0 3.0 5.0 1 2.0 4.0 NaN 2 3.0 NaN NaN
Dropping rows with less than 2 non-null values:
>>> df.dropna(thresh=2).collect() A B C 0 1.0 3.0 5.0 1 2.0 4.0 NaN
-
dtypes
(subset=None)¶ Returns a sequence of tuples describing the DataFrame’s SQL types.
The tuples list the name, SQL type name, and display size corresponding to the DataFrame’s columns.
Parameters: - subset : list of str, optional
The columns that the information will be generated from. Defaults to all columns.
Returns: - dtypes : list of tuples
Each tuple consists of the name, SQL type name, and display size for one of the DataFrame’s columns. The list will be in the order specified by the subset, or in the DataFrame’s original column order if a subset is not provided.
-
empty
()¶ Returns True if this DataFrame has 0 rows.
Parameters: - none
Returns: - empty : bool
True if the DataFrame is empty.
Notes
If a DataFrame contains only NULLs, it is not considered empty.
Examples
>>> df1.collect() Empty DataFrame Columns: [ACTUAL, PREDICT] Index: [] >>> df1.empty() True >>> df2.collect() ACTUAL PREDICT 0 None None >>> df2.empty() False
-
fillna
(value, subset=None)¶ Returns a DataFrame with NULLs replaced with a specified value.
Currently only supports filling numeric columns.
Parameters: - value : int or float
The value that will replace NULL. value should have a type appropriate for the selected columns.
- subset : list of str, optional
List of columns that will have their NULL values replaced. Defaults to all columns.
Returns: - filled_df : DataFrame
New DataFrame with the NULL values replaced.
-
filter
(condition)¶ Selects rows matching the given condition.
Very little checking is done on the condition string. Use only with trusted inputs.
Parameters: - condition : str
Filter condition. Format as SQL <condition>.
Returns: - filtered_df : DataFrame
DataFrame with rows matching the given condition.
Raises: - hana_ml.ml_exceptions.BadSQLError
If comments or malformed tokens are detected in condition. May have false positives and false negatives.
Examples
>>> df.collect() A B 0 1 3 1 2 4 >>> df.filter('B < 4').collect() A B 0 1 3
-
has
(col)¶ Returns True if a column is in the DataFrame.
Parameters: - col : str
Name of column to search in the projection list of this DataFrame.
- Returns
- ——
- has : boolean
Returns True if the column exists in the DataFrame’s projection list.
Examples
>>> df.columns ['A', 'B'] >>> df.has('A') True >>> df.has('C') False
-
hasna
(cols=None)¶ Returns True if a DataFrame contains NULLs.
Parameters: - cols : str or list of str, optional
Column or list of columns to be checked for NULL values. Defaults to all columns.
Returns: - hasna : bool
True if this DataFrame contains NULLs.
Examples
>>> df1.collect() ACTUAL PREDICT 0 1.0 None >>> df1.hasna() True
-
head
(n=1)¶ Returns a DataFrame of the first n rows in the current DataFrame.
Parameters: - n : int, optional
Number of rows returned. Default value: 1.
Returns: - sub_df : DataFrame
New DataFrame of the first n rows of the current DataFrame.
-
is_numeric
(cols=None)¶ Returns True if the column(s) in the DataFrame are numeric.
Parameters: - cols : str or list
The column(s) to be tested for being numeric. Defaults to all columns.
Returns: - bool
True if all the cols are numeric.
Examples
Input:
>>> df.head(5).collect() ID SEPALLENGTHCM SEPALWIDTHCM PETALLENGTHCM PETALWIDTHCM SPECIES 0 1 5.1 3.5 1.4 0.2 Iris-setosa 1 2 4.9 3.0 1.4 0.2 Iris-setosa 2 3 4.7 3.2 1.3 0.2 Iris-setosa 3 4 4.6 3.1 1.5 0.2 Iris-setosa 4 5 5.0 3.6 1.4 0.2 Iris-setosa >>> pprint.pprint(df.dtypes()) [('ID', 'INT', 10), ('SEPALLENGTHCM', 'DOUBLE', 15), ('SEPALWIDTHCM', 'DOUBLE', 15), ('PETALLENGTHCM', 'DOUBLE', 15), ('PETALWIDTHCM', 'DOUBLE', 15), ('SPECIES', 'NVARCHAR', 15)]
Test a single column
>>> df.is_numeric('ID') True >>> df.is_numeric('SEPALLENGTHCM') True >>> df.is_numeric(['SPECIES']) False
Test a list of columns
>>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'PETALWIDTHCM']) True >>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'SPECIES']) False
-
join
(other, condition, how='inner', select=None)¶ Returns a new DataFrame that is a join of the current DataFrame with another specified DataFrame.
Parameters: - other : DataFrame
The DataFrame to join with.
- condition : str
Join predicate.
- how : {‘inner’, ‘left’, ‘right’, ‘outer’}, optional
Type of join. Defaults to ‘inner’.
- select : list, optional
If provided, each element specifies a column in the result. A string in the select list should be the name of a column in one of the input dataframes. An (expression, name) tuple creates a new column with the given name, computed from the given expression. If not provided, defaults to selecting all columns from both dataframes, with the left dataframe’s columns first.
Returns: - joined_df : DataFrame
A new DataFrame object made from the join of the current DataFrame with another DataFrame.
Raises: - hana_ml.ml_exceptions.BadSQLError
If comments or malformed tokens are detected in condition or in a column expression. May have false positives and false negatives.
Examples
The expression selection functionality can be used to disambiguate duplicate column names in a join:
>>> df1.collect() A B C 0 1 2 3.5 1 2 4 5.6 2 3 3 1.1 >>> df2.collect() A B D 0 2 1 14.0 1 3 4 5.6 2 4 3 0.0 >>> df1.alias('L').join(df2.alias('R'), 'L.A = R.A', select=[ ... ('L.A', 'A'), ... ('L.B', 'B_LEFT'), ... ('R.B', 'B_RIGHT'), ... 'C', ... 'D']).collect() A B_LEFT B_RIGHT C D 0 2 4 1 5.6 14.0 1 3 3 4 1.1 5.6
-
name
¶ Returns the name of the DataFrame. Does not correspond to any HANA table name.
This is useful for join predicates when the joining DataFrames have columns with the same name.
-
quoted_name
¶ The escaped name of the original Dataframe.
Default-generated DataFrame names are safe to use in SQL without escaping, but names set with DataFrame.alias may require escaping.
-
rename_columns
(names)¶ Returns a DataFrame with renamed columns.
Parameters: - names : list or dict
If a list, specifies new names for every column in this dataframe. If a dict, each dict entry maps an existing name to a new name, and not all columns need to be renamed.
Returns: - DataFrame
The same data as the original DataFrame, with new column names.
See also
DataFrame.alias
- For renaming the DataFrame itself.
Examples
>>> df.collect() A B 0 1 3 1 2 4 >>> df.rename_columns(['C', 'D']).collect() C D 0 1 3 1 2 4 >>> df.rename_columns({'B': 'D'}).collect() A D 0 1 3 1 2 4
-
save
(where, table_type=None)¶ Creates a table or view holding the current DataFrame’s data.
Parameters: - where : str or (str, str) tuple
Table name or (schema name, table name) tuple. If no schema is provided, the table or view will be created in the current schema.
- table_type : str, optional
Type of table to create. Case-insensitive.
- Permanent table options:
- “ROW”
- “COLUMN”
- “HISTORY COLUMN”
- Temporary table options:
- “GLOBAL TEMPORARY”
- “GLOBAL TEMPORARY COLUMN”
- “LOCAL TEMPORARY”
- “LOCAL TEMPORARY COLUMN”
- Not a table:
- “VIEW”
Defaults to ‘LOCAL TEMPORARY COLUMN’ if where starts with ‘#’. Otherwise, the default is ‘COLUMN’.
Returns: - DataFrame
DataFrame representing the new table or view.
Notes
For this operation to succeed, the table name must not be in use, the schema must exist, and the user must have permission to create tables (or views) in the target schema.
-
select
(*cols)¶ Returns a new DataFrame with columns derived from the current DataFrame.
Warning
There is no check that inputs interpreted as SQL expressions are actually valid expressions; an “expression” like “A FROM TAB; DROP TABLE IMPORTANT_THINGS; SELECT A” can cause a lot of damage.
Parameters: - cols : str or (str, str) tuple.
Columns of the new DataFrame. A string is treated as the name of a column to select; a (str, str) tuple is treated as (SQL expression, alias). As a special case, ‘*’ is expanded to all columns of the original DataFrame.
Returns: - projected_df : DataFrame
New DataFrame object with the specified columns.
Raises: - hana_ml.ml_exceptions.BadSQLError
If comments or malformed tokens are detected in a column expression. May have false positives and false negatives.
Examples
Input:
>>> df.collect() A B C 0 1 2 3
Selecting a subset of existing columns:
>>> df.select('A', 'B').collect() A B 0 1 2
Computing a new column based on existing columns:
>>> df.select('*', ('B * 4', 'D')).collect() A B C D 0 1 2 3 8
-
sort
(cols, desc=False)¶ Returns a new DataFrame sorted by the specified columns.
Parameters: - cols : str or list of str
Column or list of columns to sort by. If a list is specified the sort order in parameter desc is used and is the same for all columns.
- desc : boolean, optional
Set to True to sort in descending order. Defaults to False, for ascending order.
Returns: - sorted_df : DataFrame
New DataFrame object with rows sorted as specified.
-
to_head
(col)¶ Returns a DataFrame with specified column as the first item in projection.
Parameters: - col : str
The column to move to head.
Returns: - DataFrame
The same data as this DataFrame but with the named column in the first position.
Examples
Input:
>>> df1 = cc.table("DBM_TRAINING") >>> import pprint >>> pprint.pprint(df1.columns) ['ID', 'AGE', 'JOB', 'MARITAL', 'EDUCATION', 'DBM_DEFAULT', 'HOUSING', 'LOAN', 'CONTACT', 'DBM_MONTH', 'DAY_OF_WEEK', 'DURATION', 'CAMPAIGN', 'PDAYS', 'PREVIOUS', 'POUTCOME', 'EMP_VAR_RATE', 'CONS_PRICE_IDX', 'CONS_CONF_IDX', 'EURIBOR3M', 'NREMPLOYED', 'LABEL']
Moving the column ‘LABEL’ to head:
>>> df2 = df1.to_head('LABEL') >>> pprint.pprint(df2.columns) ['LABEL', 'ID', 'AGE', 'JOB', 'MARITAL', 'EDUCATION', 'DBM_DEFAULT', 'HOUSING', 'LOAN', 'CONTACT', 'DBM_MONTH', 'DAY_OF_WEEK', 'DURATION', 'CAMPAIGN', 'PDAYS', 'PREVIOUS', 'POUTCOME', 'EMP_VAR_RATE', 'CONS_PRICE_IDX', 'CONS_CONF_IDX', 'EURIBOR3M', 'NREMPLOYED']
-
union
(other, all=True)¶ Combine this DataFrame’s rows and another DataFrame’s rows into one DataFrame. Equivalent to a SQL UNION ALL by default.
Parameters: - other : DataFrame
Right side of the union.
- all : bool, optional
If True, keep duplicate rows, equivalent to UNION ALL in SQL. If False, keep only one copy of duplicate rows (even if they come from the same side of the union), equivalent to a UNION or a UNION ALL followed by DISTINCT in SQL. Defaults to True.
Returns: - DataFrame
Combined data from self and other.
Examples
We have two DataFrames we want to union, with some duplicate rows:
>>> df1.collect() A B 0 1 2 1 1 2 2 2 3 >>> df2.collect() A B 0 2 3 1 3 4
union() will produce a DataFrame containing all rows of both df1 and df2, like a UNION ALL:
>>> df1.union(df2).collect() A B 0 1 2 1 1 2 2 2 3 3 2 3 4 3 4
To get the deduplication effect of a UNION DISTINCT, pass all=False or call distinct() after union():
>>> df1.union(df2, all=False).collect() A B 0 1 2 1 2 3 2 3 4 >>> df1.union(df2).distinct().collect() A B 0 1 2 1 2 3 2 3 4
-
hana_ml.dataframe.
quotename
(name)¶ Escape a schema, table, or column name for use in SQL.
hana_ml functions and methods that take schema, table, or column names already escape their input by default, but those that take SQL don’t (and can’t) perform escaping automatically.
Parameters: - name : str
Schema, table, or column name.
Returns: - str
Escaped name. The string is surrounded in quotation marks, and existing quotation marks are escaped by doubling them.