hana_ml.dataframe

This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly asked for.

The following classes and functions are available:

class hana_ml.dataframe.ConnectionContext(address='', port=0, user='', password='', autocommit=True, packetsize=None, userkey=None, **properties)

Bases: object

Represents a connection to a HANA system.

ConnectionContext includes methods for creating DataFrames from data on HANA. DataFrames are tied to a ConnectionContext, and are unusable once their ConnectionContext is closed.

Parameters:
Same as hdbcli.dbapi.connect. (See notes.)

Notes

The online docs for hdbcli.dbapi.connect are available at https://help.sap.com/viewer/0eec0d68141541d1b07893a39944924e/latest/en-US/ee592e89dcce4480a99571a4ae7a702f.html.

Examples

>>> with ConnectionContext('address', port, 'user', 'password') as cc:
...     df = (cc.table('MY_TABLE', schema='MY_SCHEMA')
...             .filter('COL3 > 5')
...             .select('COL1', 'COL2'))
...     pandas_df = df.collect()

Methods

close() Closes the existing connection.
sql(sql) Returns a DataFrame representing a query.
table(table, *[, schema]) Returns a DataFrame representing the specified table.
close()

Closes the existing connection.

Parameters:
None
Returns:
None
sql(sql)

Returns a DataFrame representing a query.

Parameters:
sql : str

SQL query.

Returns:
DataFrame

DataFrame for the query.

Examples

>>> df = cc.sql('SELECT T.A, T2.B FROM T, T2 WHERE T.C=T2.C')
table(table, *, schema=None)

Returns a DataFrame representing the specified table.

Parameters:
table : str

Table name.

schema : str, optional, keyword-only

Schema name. If not provided or set to None, defaults to the ConnectionContext’s current schema.

Returns:
DataFrame

DataFrame selecting data from that table.

Examples

>>> df1 = cc.table('MY_TABLE')
>>> df2 = cc.table('MY_OTHER_TABLE', schema='MY_SCHEMA')
class hana_ml.dataframe.DataFrame(connection_context, select_statement, _name=None)

Bases: object

This class represents a frame that is backed by a database sql statement.

Parameters:
connection_context : ConnectionContext

Connection to the HANA system.

select_statement : str

SQL query backing the dataframe.

Notes

Parameters beyond connection_context and select_statement are intended for internal usage. Do not rely on them; they may change without notice.

Attributes:
columns

Lists the current DataFrame’s column names.

name

Returns the name of the DataFrame.

quoted_name

The escaped name of the original Dataframe.

Methods

agg(agg_list[, group_by]) Returns a DataFrame with the group_by column along with the aggregates.
alias(alias) Returns a new DataFrame with an alias set.
bin(col[, strategy, bins, bin_width, bin_column]) Returns a DataFrame with original columns along with bin assignments.
cast(cols, new_type) Returns a DataFrame with columns cast to a new type.
collect() Copies the current DataFrame to a new Pandas DataFrame.
corr(first_col, second_col) Returns a DataFrame that gives the correlation coefficient between two numeric columns.
count() Computes the number of rows in the DataFrame.
describe([cols]) Returns a DataFrame containing various statistics for the requested column(s).
distinct([cols]) Returns a new DataFrame with distinct values for the specified columns.
drop(cols) Returns a new DataFrame without specified columns.
drop_duplicates([subset]) Returns a new DataFrame with duplicate rows removed.
dropna([how, thresh, subset]) Returns a new DataFrame with NULLs removed.
dtypes([subset]) Returns a sequence of tuples describing the DataFrame’s SQL types.
empty() Returns True if this DataFrame has 0 rows.
fillna(value[, subset]) Returns a DataFrame with NULLs replaced with a specified value.
filter(condition) Selects rows matching the given condition.
has(col) Returns True if a column is in the DataFrame.
hasna([cols]) Returns True if a DataFrame contains NULLs.
head([n]) Returns a DataFrame of the first n rows in the current DataFrame.
is_numeric([cols]) Returns True if the column(s) in the DataFrame are numeric.
join(other, condition[, how, select]) Returns a new DataFrame that is a join of the current DataFrame with another specified DataFrame.
rename_columns(names) Returns a DataFrame with renamed columns.
save(where[, table_type]) Creates a table or view holding the current DataFrame’s data.
select(*cols) Returns a new DataFrame with columns derived from the current DataFrame.
sort(cols[, desc]) Returns a new DataFrame sorted by the specified columns.
to_head(col) Returns a DataFrame with specified column as the first item in projection.
union(other[, all]) Combine this DataFrame’s rows and another DataFrame’s rows into one DataFrame.
agg(agg_list, group_by=None)

Returns a DataFrame with the group_by column along with the aggregates.

The name of the column in the returned DataFrame is the same as the original column.

Parameters:
agg_list : list of tuple

A list of tuples. Each tuple is a triplet. The triplet consists of (aggregate_operator, expression, name) where:

aggregate_operator is one of [‘max’, ‘min’, ‘count’, ‘avg’]

expression is a str that is a column or column expression

name is the name of this aggregate in the project list.

group_by : str or list of str, optional

Group by column. Note that only a column is allowed although expressions are allowed in SQL. To group by an expression, a dataframe would have to be created by providing the entire SQL. So, if you have a table T with columns C1, C2, and C3 that are all integers, to calculate the max(C1) grouped by (C2+C3) a DataFrame would need to be created as below:

cc.sql(‘SELECT “C2”+”C3”, max(“C1”) FROM “T” GROUP BY “C2”+”C3”’)

Returns:
DataFrame

DataFrame containing the group_by column if it exists along with the aggregate expressions that are aliased with the specified names.

Examples

Input:

>>> df.collect()
    ID  SEPALLENGTHCM  SEPALWIDTHCM  PETALLENGTHCM  PETALWIDTHCM          SPECIES
0    1            5.1           3.5            1.4           0.2      Iris-setosa
1    2            4.9           3.0            1.4           0.2      Iris-setosa
2    3            4.7           3.2            1.3           0.2      Iris-setosa
3   51            7.0           3.2            4.7           1.4  Iris-versicolor
4   52            6.4           3.2            4.5           1.5  Iris-versicolor
5  101            6.3           3.3            6.0           2.5   Iris-virginica
6  102            5.8           2.7            5.1           1.9   Iris-virginica
7  103            7.1           3.0            5.9           2.1   Iris-virginica
8  104            6.3           2.9            5.6           1.8   Iris-virginica

Another way to do a count.

>>> df.agg([('count', 'SPECIES', 'COUNT')]).collect()
   COUNT
0      9

Get counts by SPECIES

>>> df.agg([('count', 'SPECIES', 'COUNT')], group_by='SPECIES').collect()
           SPECIES  COUNT
0  Iris-versicolor      2
1   Iris-virginica      4
2      Iris-setosa      3

Get max values of SEPALLENGTHCM by SPECIES

>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH')], group_by='SPECIES').collect()
           SPECIES  MAX_SEPAL_LENGTH
0  Iris-versicolor               7.0
1   Iris-virginica               7.1
2      Iris-setosa               5.1

Get max and min values of SEPALLENGTHCM by SPECIES

>>> df.agg([('max', 'SEPALLENGTHCM', 'MAX_SEPAL_LENGTH'),
    ('min', 'SEPALLENGTHCM', 'MIN_SEPAL_LENGTH')], group_by=['SPECIES']).collect()
           SPECIES  MAX_SEPAL_LENGTH  MIN_SEPAL_LENGTH
0  Iris-versicolor               7.0               6.4
1   Iris-virginica               7.1               5.8
2      Iris-setosa               5.1               4.7

Get aggregate grouping by multiple columns.

>>> df.agg([('count', 'SEPALLENGTHCM', 'COUNT_SEPAL_LENGTH')],
            group_by=['SPECIES', 'PETALLENGTHCM']).collect()
           SPECIES  PETALLENGTHCM  COUNT_SEPAL_LENGTH
0   Iris-virginica            6.0                   1
1      Iris-setosa            1.3                   1
2   Iris-virginica            5.9                   1
3   Iris-virginica            5.6                   1
4      Iris-setosa            1.4                   2
5  Iris-versicolor            4.7                   1
6  Iris-versicolor            4.5                   1
7   Iris-virginica            5.1                   1
alias(alias)

Returns a new DataFrame with an alias set.

Parameters:
alias : str

Name of the DataFrame.

Returns:
aliased_df : DataFrame

DataFrame with an alias set.

See also

DataFrame.rename_columns
For renaming individual columns.
bin(col, strategy='uniform_number', bins=None, bin_width=None, bin_column='BIN_NUMBER')

Returns a DataFrame with original columns along with bin assignments.

The name of the columns in the returned DataFrame is the same as the original column. Column “BIN_NUMBER” or the specified value in bin_column is added and corresponds to the bin assigned.

Parameters:
col : str

The column on which binning is performed. The column must be numeric.

strategy : {‘uniform_number’, ‘uniform_size’}
Binning methods:
  • ‘uniform_number’: Equal widths based on the number of bins.
  • ‘uniform_size’: Equal widths based on the bin size.
bins : int, optional

The number of equal width bins. Only valid when strategy is ‘uniform_number’. Defaults to 10.

bin_width : int, optional

The interval width of each bin. Only valid when strategy is ‘uniform_size’.

bin_column : str, optional

Name of the output column containing the bin number.

Returns:
DataFrame

Binned dataset with the same data as this DataFrame along with an additional column “BIN_NUMBER” or the value specified in bin_column. This additional column contains the assigned bin for each row.

Examples

Input:

>>> df.collect()
   C1   C2    C3       C4
0   1  1.2   2.0      1.0
1   2  1.4   4.0      3.0
2   3  1.6   6.0      9.0
3   4  1.8   8.0     27.0
4   5  2.0  10.0     81.0
5   6  2.2  12.0    243.0
6   7  2.4  14.0    729.0
7   8  2.6  16.0   2187.0
8   9  2.8  18.0   6561.0
9  10  3.0  20.0  19683.0

Create 5 bins of equal widths on C1

>>> df.bin('C1', strategy='uniform_number', bins=5).collect()
   C1   C2    C3       C4  BIN_NUMBER
0   1  1.2   2.0      1.0           1
1   2  1.4   4.0      3.0           1
2   3  1.6   6.0      9.0           2
3   4  1.8   8.0     27.0           2
4   5  2.0  10.0     81.0           3
5   6  2.2  12.0    243.0           3
6   7  2.4  14.0    729.0           4
7   8  2.6  16.0   2187.0           4
8   9  2.8  18.0   6561.0           5
9  10  3.0  20.0  19683.0           5

Create 5 bins of equal widths on C2

>>> df.bin('C3', strategy='uniform_number', bins=5).collect()
   C1   C2    C3       C4  BIN_NUMBER
0   1  1.2   2.0      1.0           1
1   2  1.4   4.0      3.0           1
2   3  1.6   6.0      9.0           2
3   4  1.8   8.0     27.0           2
4   5  2.0  10.0     81.0           3
5   6  2.2  12.0    243.0           3
6   7  2.4  14.0    729.0           4
7   8  2.6  16.0   2187.0           4
8   9  2.8  18.0   6561.0           5
9  10  3.0  20.0  19683.0           5

Create 5 bins of equal widths on a column that varies significantly

>>> df.bin('C4', strategy='uniform_number', bins=5).collect()
   C1   C2    C3       C4  BIN_NUMBER
0   1  1.2   2.0      1.0           1
1   2  1.4   4.0      3.0           1
2   3  1.6   6.0      9.0           1
3   4  1.8   8.0     27.0           1
4   5  2.0  10.0     81.0           1
5   6  2.2  12.0    243.0           1
6   7  2.4  14.0    729.0           1
7   8  2.6  16.0   2187.0           1
8   9  2.8  18.0   6561.0           2
9  10  3.0  20.0  19683.0           5

Create bins of equal width

>>> df.bin('C1', strategy='uniform_size', bin_width=3).collect()
   C1   C2    C3       C4  BIN_NUMBER
0   1  1.2   2.0      1.0           1
1   2  1.4   4.0      3.0           1
2   3  1.6   6.0      9.0           2
3   4  1.8   8.0     27.0           2
4   5  2.0  10.0     81.0           2
5   6  2.2  12.0    243.0           3
6   7  2.4  14.0    729.0           3
7   8  2.6  16.0   2187.0           3
8   9  2.8  18.0   6561.0           4
9  10  3.0  20.0  19683.0           4
cast(cols, new_type)

Returns a DataFrame with columns cast to a new type.

The name of the column in the returned DataFrame is the same as the original column.

Warning

Type compatibility between the existing column type and the new type is not checked. An incompatibility will result in an error.

Parameters:
cols : str or list

The column(s) to be cast to a different type.

new_type : str

The database datatype to cast the column(s) to. No checks are performed to see if the new type is valid. An invalid type can lead to SQL errors or even SQL injection vulnerabilities.

Returns:
DataFrame

The same data as this DataFrame but with columns cast to the specified type.

Examples

Input:

>>> df1 = cc.sql('SELECT "AGE", "PDAYS", "HOUSING" FROM DBM_TRAINING_TBL')
>>> df1.dtypes()
[('AGE', 'INT', 10), ('PDAYS', 'INT', 10), ('HOUSING', 'VARCHAR', 100)]

Casting a column to NVARCHAR(20):

>>> df2 = df1.cast('AGE', 'NVARCHAR(20)')
>>> df2.dtypes()
[('AGE', 'NVARCHAR', 20), ('PDAYS', 'INT', 10), ('HOUSING', 'VARCHAR', 100)]

Casting a list of columns to NVARCHAR(50):

>>> df3 = df1.cast(['AGE', 'PDAYS'], 'NVARCHAR(50)')
>>> df3.dtypes()
[('AGE', 'NVARCHAR', 50), ('PDAYS', 'NVARCHAR', 50), ('HOUSING', 'VARCHAR', 100)]
collect()

Copies the current DataFrame to a new Pandas DataFrame.

Parameters:
none
Returns:
pandas_df : pandas.DataFrame

Pandas DataFrame containing the current DataFrame’s data.

Examples

Viewing a hana_ml DataFrame doesn’t reveal much on its own, because it doesn’t execute the underlying SQL or fetch data:

>>> df = cc.table('T')
>>> df
<hana_ml.dataframe.DataFrame object at 0x7f2b7f24ddd8>

Using collect() will execute the SQL and fetch the results into a Pandas DataFrame:

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> type(df.collect())
<class 'pandas.core.frame.DataFrame'>
columns

Lists the current DataFrame’s column names. Computed lazily and cached.

Each access to this property creates a new copy; mutating the list will not alter or corrupt the DataFrame.

corr(first_col, second_col)

Returns a DataFrame that gives the correlation coefficient between two numeric columns.

All rows with NULL values for first_col or second_col are removed prior to calculating the correlation coefficient.

The correlation coefficient is:
1/(n-1) * sum((col1_value - avg(col1)) * (col2 - avg(col2))) / (stddev(col1) * stddev(col2))
Parameters:
first_col : str

The first column for calculating the correlation coefficient.

first_col : str

The second column for calculating the correlation coefficient.

Returns:
DataFrame

A DataFrame with one value containing the correlation coefficient. The name of the column is CORR_COEFF.

Examples

Input:

>>> df.columns
['C1', 'C2', 'C3', 'C4']
>>> df.collect()
   C1   C2      C3       C4
0   1  1.2     2.0      1.0
1   2  1.4     4.0      3.0
2   3  1.6     8.0      9.0
3   4  1.8    16.0     27.0
4   5  2.0    32.0     81.0
5   6  2.2    64.0    243.0
6   7  2.4   128.0    729.0
7   8  2.6   256.0   2187.0
8   9  2.8   512.0   6561.0
9  10  3.0  1024.0  19683.0

Correlation with columns that are well correlated. >>> df.corr(‘C1’, ‘C2’).collect()

CORR_COEFF

0 1.0 >>> df.corr(‘C1’, ‘C3’).collect()

CORR_COEFF

0 1.0

Correlation with a column whose value is 3 times its previous value >>> df.corr(‘C1’, ‘C4’).collect()

CORR_COEFF

0 0.696325

count()

Computes the number of rows in the DataFrame.

Parameters:
none
Returns:
count : int

Number of rows in the DataFrame.

describe(cols=None)

Returns a DataFrame containing various statistics for the requested column(s).

Parameters:
cols : str or list

The column(s) to be described. Defaults to all columns.

Returns:
DataFrame

A DataFrame containing statistics for the specified column(s) in the current DataFrame.

The statistics included are the count of rows (“count”), number of distinct values (“unique”), number of nulls (“nulls”), average (“mean”), standard deviation(“std”) median (“median”), minimum value (“min”), maximum value (“max”), 25% percentile when treated as continuous variable (“25_percent_cont”), 25% percentile when treated as discrete variable (“25_percent_disc”), 50% percentile when treated as continuous variable (“50_percent_cont”), 50% percentile when treated as discrete variable (“50_percent_disc”), 75% percentile when treated as continuous variable (“75_percent_cont”), 75% percentile when treated as discrete variable (“75_percent_disc”).

For columns that are strings, statistics such as average (“mean”), standard deviation (“std”), median (“median”), and the various percentiles are NULLs.

If the list of columns contain both string and numeric data types, minimum and maximum values become NULLs.

Examples

Input:

>>> df1 = cc.table("DBM_TRAINING")
>>> import pprint
>>> pprint.pprint(df2.columns)
['LABEL',
 'ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED']

Describe a few numeric columns and collect to return a Pandas DataFrame:

>>> df1.describe(['AGE', 'PDAYS']).collect()
  column  count  unique  nulls        mean         std  min  max  median
0    AGE  16895      78      0   40.051376   10.716907   17   98      38
1  PDAYS  16895      24      0  944.406688  226.331944    0  999     999
   25_percent_cont  25_percent_disc  50_percent_cont  50_percent_disc
0             32.0               32             38.0               38
1            999.0              999            999.0              999
   75_percent_cont  75_percent_disc
0             47.0               47
1            999.0              999

Describe a few non-numeric columns and collect to return a Pandas DataFrame:

>>> df1.describe(['JOB', 'MARITAL']).collect()
    column  count  unique  nulls  mean   std       min      max median
0      JOB  16895      12      0  None  None    admin.  unknown   None
1  MARITAL  16895       4      0  None  None  divorced  unknown   None
  25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc
0            None            None            None            None
1            None            None            None            None
  75_percent_cont 75_percent_disc
0            None            None
1            None            None

Describe all columns in a DataFrame:

>>> df1.describe().collect()
            column  count  unique  nulls          mean           std
0               ID  16895   16895      0  21282.286652  12209.759725
1              AGE  16895      78      0     40.051376     10.716907
2         DURATION  16895    1267      0    263.965670    264.331384
3         CAMPAIGN  16895      35      0      2.344658      2.428449
4            PDAYS  16895      24      0    944.406688    226.331944
5         PREVIOUS  16895       7      0      0.209529      0.539450
6     EMP_VAR_RATE  16895      10      0     -0.038798      1.621945
7   CONS_PRICE_IDX  16895      26      0     93.538844      0.579189
8    CONS_CONF_IDX  16895      26      0    -40.334123      4.865720
9        EURIBOR3M  16895     283      0      3.499297      1.777986
10      NREMPLOYED  16895      11      0   5160.371885     75.320580
11             JOB  16895      12      0           NaN           NaN
12         MARITAL  16895       4      0           NaN           NaN
13       EDUCATION  16895       8      0           NaN           NaN
14     DBM_DEFAULT  16895       2      0           NaN           NaN
15         HOUSING  16895       3      0           NaN           NaN
16            LOAN  16895       3      0           NaN           NaN
17         CONTACT  16895       2      0           NaN           NaN
18       DBM_MONTH  16895      10      0           NaN           NaN
19     DAY_OF_WEEK  16895       5      0           NaN           NaN
20        POUTCOME  16895       3      0           NaN           NaN
21           LABEL  16895       2      0           NaN           NaN
         min        max     median  25_percent_cont  25_percent_disc
0      5.000  41187.000  21786.000        10583.500        10583.000
1     17.000     98.000     38.000           32.000           32.000
2      0.000   4918.000    184.000          107.000          107.000
3      1.000     43.000      2.000            1.000            1.000
4      0.000    999.000    999.000          999.000          999.000
5      0.000      6.000      0.000            0.000            0.000
6     -3.400      1.400      1.100           -1.800           -1.800
7     92.201     94.767     93.444           93.075           93.075
8    -50.800    -26.900    -41.800          -42.700          -42.700
9      0.634      5.045      4.856            1.313            1.313
10  4963.000   5228.000   5191.000         5099.000         5099.000
11       NaN        NaN        NaN              NaN              NaN
12       NaN        NaN        NaN              NaN              NaN
13       NaN        NaN        NaN              NaN              NaN
14       NaN        NaN        NaN              NaN              NaN
15       NaN        NaN        NaN              NaN              NaN
16       NaN        NaN        NaN              NaN              NaN
17       NaN        NaN        NaN              NaN              NaN
18       NaN        NaN        NaN              NaN              NaN
19       NaN        NaN        NaN              NaN              NaN
20       NaN        NaN        NaN              NaN              NaN
21       NaN        NaN        NaN              NaN              NaN
    50_percent_cont  50_percent_disc  75_percent_cont  75_percent_disc
0         21786.000        21786.000        32067.500        32068.000
1            38.000           38.000           47.000           47.000
2           184.000          184.000          324.000          324.000
3             2.000            2.000            3.000            3.000
4           999.000          999.000          999.000          999.000
5             0.000            0.000            0.000            0.000
6             1.100            1.100            1.400            1.400
7            93.444           93.444           93.994           93.994
8           -41.800          -41.800          -36.400          -36.400
9             4.856            4.856            4.961            4.961
10         5191.000         5191.000         5228.000         5228.000
11              NaN              NaN              NaN              NaN
12              NaN              NaN              NaN              NaN
13              NaN              NaN              NaN              NaN
14              NaN              NaN              NaN              NaN
15              NaN              NaN              NaN              NaN
16              NaN              NaN              NaN              NaN
17              NaN              NaN              NaN              NaN
18              NaN              NaN              NaN              NaN
19              NaN              NaN              NaN              NaN
20              NaN              NaN              NaN              NaN
21              NaN              NaN              NaN              NaN
df_count = 0
distinct(cols=None)

Returns a new DataFrame with distinct values for the specified columns. If no columns are specified, the distinct row values from all columns are returned.

Parameters:
cols : str or list of str, optional

Column or list of columns to consider when getting distinct values. Defaults to use all columns.

Returns:
distinct_df : DataFrame

DataFrame with distinct values for cols.

Examples

Input:

>>> df.collect()
   A  B    C
0  1  A  100
1  1  A  101
2  1  A  102
3  1  B  100
4  1  B  101
5  1  B  102
6  1  B  103
7  2  A  100
8  2  A  100

Distinct values in a column:

>>> df.distinct("B").collect()
   B
0  A
1  B

Distinct values of a subset of columns:

>>> df.distinct(["A", "B"]).collect()
   A  B
0  1  B
1  2  A
2  1  A

Distinct values of the entire data set:

>>> df.distinct().collect()
   A  B    C
0  1  A  102
1  1  B  103
2  1  A  101
3  2  A  100
4  1  B  101
5  1  A  100
6  1  B  100
7  1  B  102
drop(cols)

Returns a new DataFrame without specified columns.

Parameters:
cols : list of str

List of column names to drop.

Returns:
filtered_df : DataFrame

New DataFrame retaining only columns not in cols.

Examples

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> df.drop(['B']).collect()
   A
0  1
1  2
drop_duplicates(subset=None)

Returns a new DataFrame with duplicate rows removed. All columns in the DataFrame are returned. There is no way to keep specific duplicate rows.

Warning

Specifying a non-None value of subset may produce an “unstable” DataFrame, the contents of which may be different every time you look at it. Specifically, if two rows are duplicates in their subset columns and have different values in other columns, a different row could be picked every time you look at the result.

Parameters:
subset : list of str, optional

List of columns to consider when deciding whether rows are duplicates of each other. Defaults to use all columns.

Returns:
deduplicated_df : DataFrame

DataFrame with only one copy of duplicate rows.

Examples

Input:

>>> df.collect()
   A  B    C
0  1  A  100
1  1  A  101
2  1  A  102
3  1  B  100
4  1  B  101
5  1  B  102
6  1  B  103
7  2  A  100
8  2  A  100

Drop duplicates based on the values of a subset of columns:

>>> df.drop_duplicates(["A", "B"]).collect()
   A  B    C
0  1  A  100
1  1  B  100
2  2  A  100

Distinct values on the entire data set:

>>> df.drop_duplicates().collect()
   A  B    C
0  1  A  102
1  1  B  103
2  1  A  101
3  2  A  100
4  1  B  101
5  1  A  100
6  1  B  100
7  1  B  102
dropna(how=None, thresh=None, subset=None)

Returns a new DataFrame with NULLs removed.

Parameters:
how : {‘any’, ‘all’}, optional

If provided, ‘any’ eliminates rows with any NULLs, and ‘all’ eliminates rows that are entirely NULLs. If neither how nor thresh are provided, how defaults to ‘any’.

thresh : int, optional

If provided, values with less than the thresh non-NULL values are dropped. how and thresh cannot both be provided.

subset : list of str, optional

Columns to consider when looking for NULLs. Values in other columns will be ignored, NULL or not. Defaults to all columns.

Returns:
filtered_df : DataFrame

New DataFrame with select statement that removes NULLs.

Examples

Dropping rows with any nulls:

>>> df.collect()
     A    B    C
0  1.0  3.0  5.0
1  2.0  4.0  NaN
2  3.0  NaN  NaN
3  NaN  NaN  NaN
>>> df.dropna().collect()
     A    B    C
0  1.0  3.0  5.0

Dropping rows that are entirely nulls:

>>> df.dropna(how='all').collect()
     A    B    C
0  1.0  3.0  5.0
1  2.0  4.0  NaN
2  3.0  NaN  NaN

Dropping rows with less than 2 non-null values:

>>> df.dropna(thresh=2).collect()
     A    B    C
0  1.0  3.0  5.0
1  2.0  4.0  NaN
dtypes(subset=None)

Returns a sequence of tuples describing the DataFrame’s SQL types.

The tuples list the name, SQL type name, and display size corresponding to the DataFrame’s columns.

Parameters:
subset : list of str, optional

The columns that the information will be generated from. Defaults to all columns.

Returns:
dtypes : list of tuples

Each tuple consists of the name, SQL type name, and display size for one of the DataFrame’s columns. The list will be in the order specified by the subset, or in the DataFrame’s original column order if a subset is not provided.

empty()

Returns True if this DataFrame has 0 rows.

Parameters:
none
Returns:
empty : bool

True if the DataFrame is empty.

Notes

If a DataFrame contains only NULLs, it is not considered empty.

Examples

>>> df1.collect()
Empty DataFrame
Columns: [ACTUAL, PREDICT]
Index: []
>>> df1.empty()
True
>>> df2.collect()
  ACTUAL PREDICT
0   None    None
>>> df2.empty()
False
fillna(value, subset=None)

Returns a DataFrame with NULLs replaced with a specified value.

Currently only supports filling numeric columns.

Parameters:
value : int or float

The value that will replace NULL. value should have a type appropriate for the selected columns.

subset : list of str, optional

List of columns that will have their NULL values replaced. Defaults to all columns.

Returns:
filled_df : DataFrame

New DataFrame with the NULL values replaced.

filter(condition)

Selects rows matching the given condition.

Very little checking is done on the condition string. Use only with trusted inputs.

Parameters:
condition : str

Filter condition. Format as SQL <condition>.

Returns:
filtered_df : DataFrame

DataFrame with rows matching the given condition.

Raises:
hana_ml.ml_exceptions.BadSQLError

If comments or malformed tokens are detected in condition. May have false positives and false negatives.

Examples

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> df.filter('B < 4').collect()
   A  B
0  1  3
has(col)

Returns True if a column is in the DataFrame.

Parameters:
col : str

Name of column to search in the projection list of this DataFrame.

Returns
——
has : boolean

Returns True if the column exists in the DataFrame’s projection list.

Examples

>>> df.columns
['A', 'B']
>>> df.has('A')
True
>>> df.has('C')
False
hasna(cols=None)

Returns True if a DataFrame contains NULLs.

Parameters:
cols : str or list of str, optional

Column or list of columns to be checked for NULL values. Defaults to all columns.

Returns:
hasna : bool

True if this DataFrame contains NULLs.

Examples

>>> df1.collect()
  ACTUAL PREDICT
0   1.0    None
>>> df1.hasna()
True
head(n=1)

Returns a DataFrame of the first n rows in the current DataFrame.

Parameters:
n : int, optional

Number of rows returned. Default value: 1.

Returns:
sub_df : DataFrame

New DataFrame of the first n rows of the current DataFrame.

is_numeric(cols=None)

Returns True if the column(s) in the DataFrame are numeric.

Parameters:
cols : str or list

The column(s) to be tested for being numeric. Defaults to all columns.

Returns:
bool

True if all the cols are numeric.

Examples

Input:

>>> df.head(5).collect()
ID  SEPALLENGTHCM  SEPALWIDTHCM  PETALLENGTHCM  PETALWIDTHCM      SPECIES
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa
>>> pprint.pprint(df.dtypes())
[('ID', 'INT', 10),
 ('SEPALLENGTHCM', 'DOUBLE', 15),
 ('SEPALWIDTHCM', 'DOUBLE', 15),
 ('PETALLENGTHCM', 'DOUBLE', 15),
 ('PETALWIDTHCM', 'DOUBLE', 15),
 ('SPECIES', 'NVARCHAR', 15)]

Test a single column

>>> df.is_numeric('ID')
True
>>> df.is_numeric('SEPALLENGTHCM')
True
>>> df.is_numeric(['SPECIES'])
False

Test a list of columns

>>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'PETALWIDTHCM'])
True
>>> df.is_numeric(['SEPALLENGTHCM', 'PETALLENGTHCM', 'SPECIES'])
False
join(other, condition, how='inner', select=None)

Returns a new DataFrame that is a join of the current DataFrame with another specified DataFrame.

Parameters:
other : DataFrame

The DataFrame to join with.

condition : str

Join predicate.

how : {‘inner’, ‘left’, ‘right’, ‘outer’}, optional

Type of join. Defaults to ‘inner’.

select : list, optional

If provided, each element specifies a column in the result. A string in the select list should be the name of a column in one of the input dataframes. An (expression, name) tuple creates a new column with the given name, computed from the given expression. If not provided, defaults to selecting all columns from both dataframes, with the left dataframe’s columns first.

Returns:
joined_df : DataFrame

A new DataFrame object made from the join of the current DataFrame with another DataFrame.

Raises:
hana_ml.ml_exceptions.BadSQLError

If comments or malformed tokens are detected in condition or in a column expression. May have false positives and false negatives.

Examples

The expression selection functionality can be used to disambiguate duplicate column names in a join:

>>> df1.collect()
   A  B    C
0  1  2  3.5
1  2  4  5.6
2  3  3  1.1
>>> df2.collect()
   A  B     D
0  2  1  14.0
1  3  4   5.6
2  4  3   0.0
>>> df1.alias('L').join(df2.alias('R'), 'L.A = R.A', select=[
...     ('L.A', 'A'),
...     ('L.B', 'B_LEFT'),
...     ('R.B', 'B_RIGHT'),
...     'C',
...     'D']).collect()
   A  B_LEFT  B_RIGHT    C     D
0  2       4        1  5.6  14.0
1  3       3        4  1.1   5.6
name

Returns the name of the DataFrame. Does not correspond to any HANA table name.

This is useful for join predicates when the joining DataFrames have columns with the same name.

quoted_name

The escaped name of the original Dataframe.

Default-generated DataFrame names are safe to use in SQL without escaping, but names set with DataFrame.alias may require escaping.

rename_columns(names)

Returns a DataFrame with renamed columns.

Parameters:
names : list or dict

If a list, specifies new names for every column in this dataframe. If a dict, each dict entry maps an existing name to a new name, and not all columns need to be renamed.

Returns:
DataFrame

The same data as the original DataFrame, with new column names.

See also

DataFrame.alias
For renaming the DataFrame itself.

Examples

>>> df.collect()
   A  B
0  1  3
1  2  4
>>> df.rename_columns(['C', 'D']).collect()
   C  D
0  1  3
1  2  4
>>> df.rename_columns({'B': 'D'}).collect()
   A  D
0  1  3
1  2  4
save(where, table_type=None)

Creates a table or view holding the current DataFrame’s data.

Parameters:
where : str or (str, str) tuple

Table name or (schema name, table name) tuple. If no schema is provided, the table or view will be created in the current schema.

table_type : str, optional

Type of table to create. Case-insensitive.

Permanent table options:
  • “ROW”
  • “COLUMN”
  • “HISTORY COLUMN”
Temporary table options:
  • “GLOBAL TEMPORARY”
  • “GLOBAL TEMPORARY COLUMN”
  • “LOCAL TEMPORARY”
  • “LOCAL TEMPORARY COLUMN”
Not a table:
  • “VIEW”

Defaults to ‘LOCAL TEMPORARY COLUMN’ if where starts with ‘#’. Otherwise, the default is ‘COLUMN’.

Returns:
DataFrame

DataFrame representing the new table or view.

Notes

For this operation to succeed, the table name must not be in use, the schema must exist, and the user must have permission to create tables (or views) in the target schema.

select(*cols)

Returns a new DataFrame with columns derived from the current DataFrame.

Warning

There is no check that inputs interpreted as SQL expressions are actually valid expressions; an “expression” like “A FROM TAB; DROP TABLE IMPORTANT_THINGS; SELECT A” can cause a lot of damage.

Parameters:
cols : str or (str, str) tuple.

Columns of the new DataFrame. A string is treated as the name of a column to select; a (str, str) tuple is treated as (SQL expression, alias). As a special case, ‘*’ is expanded to all columns of the original DataFrame.

Returns:
projected_df : DataFrame

New DataFrame object with the specified columns.

Raises:
hana_ml.ml_exceptions.BadSQLError

If comments or malformed tokens are detected in a column expression. May have false positives and false negatives.

Examples

Input:

>>> df.collect()
   A  B  C
0  1  2  3

Selecting a subset of existing columns:

>>> df.select('A', 'B').collect()
   A  B
0  1  2

Computing a new column based on existing columns:

>>> df.select('*', ('B * 4', 'D')).collect()
   A  B  C  D
0  1  2  3  8
sort(cols, desc=False)

Returns a new DataFrame sorted by the specified columns.

Parameters:
cols : str or list of str

Column or list of columns to sort by. If a list is specified the sort order in parameter desc is used and is the same for all columns.

desc : boolean, optional

Set to True to sort in descending order. Defaults to False, for ascending order.

Returns:
sorted_df : DataFrame

New DataFrame object with rows sorted as specified.

to_head(col)

Returns a DataFrame with specified column as the first item in projection.

Parameters:
col : str

The column to move to head.

Returns:
DataFrame

The same data as this DataFrame but with the named column in the first position.

Examples

Input:

>>> df1 = cc.table("DBM_TRAINING")
>>> import pprint
>>> pprint.pprint(df1.columns)
['ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED',
 'LABEL']

Moving the column ‘LABEL’ to head:

>>> df2 = df1.to_head('LABEL')
>>> pprint.pprint(df2.columns)
['LABEL',
 'ID',
 'AGE',
 'JOB',
 'MARITAL',
 'EDUCATION',
 'DBM_DEFAULT',
 'HOUSING',
 'LOAN',
 'CONTACT',
 'DBM_MONTH',
 'DAY_OF_WEEK',
 'DURATION',
 'CAMPAIGN',
 'PDAYS',
 'PREVIOUS',
 'POUTCOME',
 'EMP_VAR_RATE',
 'CONS_PRICE_IDX',
 'CONS_CONF_IDX',
 'EURIBOR3M',
 'NREMPLOYED']
union(other, all=True)

Combine this DataFrame’s rows and another DataFrame’s rows into one DataFrame. Equivalent to a SQL UNION ALL by default.

Parameters:
other : DataFrame

Right side of the union.

all : bool, optional

If True, keep duplicate rows, equivalent to UNION ALL in SQL. If False, keep only one copy of duplicate rows (even if they come from the same side of the union), equivalent to a UNION or a UNION ALL followed by DISTINCT in SQL. Defaults to True.

Returns:
DataFrame

Combined data from self and other.

Examples

We have two DataFrames we want to union, with some duplicate rows:

>>> df1.collect()
   A  B
0  1  2
1  1  2
2  2  3
>>> df2.collect()
   A  B
0  2  3
1  3  4

union() will produce a DataFrame containing all rows of both df1 and df2, like a UNION ALL:

>>> df1.union(df2).collect()
   A  B
0  1  2
1  1  2
2  2  3
3  2  3
4  3  4

To get the deduplication effect of a UNION DISTINCT, pass all=False or call distinct() after union():

>>> df1.union(df2, all=False).collect()
   A  B
0  1  2
1  2  3
2  3  4
>>> df1.union(df2).distinct().collect()
   A  B
0  1  2
1  2  3
2  3  4
hana_ml.dataframe.quotename(name)

Escape a schema, table, or column name for use in SQL.

hana_ml functions and methods that take schema, table, or column names already escape their input by default, but those that take SQL don’t (and can’t) perform escaping automatically.

Parameters:
name : str

Schema, table, or column name.

Returns:
str

Escaped name. The string is surrounded in quotation marks, and existing quotation marks are escaped by doubling them.