train_test_val_split
- hana_ml.algorithms.pal.partition.train_test_val_split(data, id_column=None, random_seed=None, thread_ratio=None, partition_method='random', stratified_column=None, training_percentage=None, testing_percentage=None, validation_percentage=None, training_size=None, testing_size=None, validation_size=None)
The algorithm partitions an input dataset randomly into three disjoint subsets called training, testing and validation. Let us remark that the union of these three subsets might not be the complete initial dataset.
Please also note that the dataset must have an ID column. The ID column can be specified explicitly, otherwise it's assumed that the first column of the dataframe holds the ID.
Two different partitions can be obtained:
Random Partition, which randomly divides all the data.
Stratified Partition, which divides each subpopulation randomly.
In the second case, the dataset needs to have at least one categorical attribute (for example, of type VARCHAR). The initial dataset will first be subdivided according to the different categorical values of this attribute. Each mutually exclusive subset will then be randomly split to obtain the training, testing, and validation subsets.This ensures that all "categorical values" or "strata" will be present in the sampled subset.
- Parameters:
- dataDataFrame
DataFrame to be partitioned.
- id_column: str, optional
Indicates which column to use as the ID column, Defaults to first column.
- random_seedint, optional
Indicates the seed used to initialize the random number generator.
0: Uses the system time.
Not 0: Uses the specified seed.
Defaults to 0.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 0.
- partition_method{'random', 'stratified'}, optional
Partition method:
'random': random partitions.
'stratified': stratified partition.
Defaults to 'random'.
- stratified_columnstr, optional
Indicates which column is used for stratification.
Valid only when
parition_method
is set to 'stratified' (stratified partition).No default value.
- training_percentagefloat, optional
The percentage of training data.
Value range: 0 <= value <= 1.
Defaults to 0.8.
- testing_percentagefloat, optional
The percentage of testing data.
Value range: 0 <= value <= 1.
Defaults to 0.1.
- validation_percentagefloat, optional
The percentage of validation data.
Value range: 0 <= value <= 1.
Defaults to 0.1.
- training_sizeint, optional
Row size of training data. Value range: >=0.
If both
training_percentage
andtraining_size
are specified,training_percentage
takes precedence.No default value.
- testing_sizeint, optional
Row size of testing data. Value range: >=0.
If both
testing_percentage
andtesting_size
are specified,testing_percentage
takes precedence.No default value.
- validation_sizeint, optional
Row size of validation data. Value range:>=0.
If both
validation_percentage
andvalidation_size
are specified,validation_percentage
takes precedence.No default value.
- Returns:
- Returns three DataFrames of training data, testing data and validation data after partition.
Examples
To partition the input DataFrame df:
>>> train_df, test_df, valid_df = train_test_val_split(data=df, training_percentage=0.7, testing_percentage=0.2, validation_percentage=0.1)