train_test_val_split

hana_ml.algorithms.pal.partition.train_test_val_split(data, id_column=None, random_seed=None, thread_ratio=None, partition_method='random', stratified_column=None, training_percentage=None, testing_percentage=None, validation_percentage=None, training_size=None, testing_size=None, validation_size=None)

The algorithm partitions an input dataset randomly into three disjoint subsets called training, testing and validation. Let us remark that the union of these three subsets might not be the complete initial dataset.

Please also note that the dataset must have an ID column. The ID column can be specified explicitly, otherwise it's assumed that the first column of the dataframe holds the ID.

Two different partitions can be obtained:

Random Partition, which randomly divides all the data.
Stratified Partition, which divides each subpopulation randomly.

In the second case, the dataset needs to have at least one categorical attribute (for example, of type VARCHAR). The initial dataset will first be subdivided according to the different categorical values of this attribute. Each mutually exclusive subset will then be randomly split to obtain the training, testing, and validation subsets.This ensures that all "categorical values" or "strata" will be present in the sampled subset.

Parameters:

dataDataFrame

DataFrame to be partitioned.

id_column: str, optional

Indicates which column to use as the ID column, Defaults to first column.

random_seedint, optional

Indicates the seed used to initialize the random number generator.

0: Uses the system time.
Not 0: Uses the specified seed.

Defaults to 0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by this function.

The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads.

Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 0.

partition_method{'random', 'stratified'}, optional

Partition method:

'random': random partitions.
'stratified': stratified partition.

Defaults to 'random'.

stratified_columnstr, optional

Indicates which column is used for stratification.

Valid only when parition_method is set to 'stratified' (stratified partition).

No default value.

training_percentagefloat, optional

The percentage of training data.

Value range: 0 <= value <= 1.

Defaults to 0.8.

testing_percentagefloat, optional

The percentage of testing data.

Value range: 0 <= value <= 1.

Defaults to 0.1.

validation_percentagefloat, optional

The percentage of validation data.

Value range: 0 <= value <= 1.

Defaults to 0.1.

training_sizeint, optional

Row size of training data. Value range: >=0.

If both training_percentage and training_size are specified, training_percentage takes precedence.

No default value.

testing_sizeint, optional

Row size of testing data. Value range: >=0.

If both testing_percentage and testing_size are specified, testing_percentage takes precedence.

No default value.

validation_sizeint, optional

Row size of validation data. Value range:>=0.

If both validation_percentage and validation_size are specified, validation_percentage takes precedence.

No default value.

Returns:

Returns three DataFrame of training data, testing data and validation data after partition.

Examples

To partition the input DataFrame df:

>>> train, test, valid = train_test_val_split(data=df)