Partition

You can configure properties for the Partition component in HANA and non-HANA scenarios.

Syntax

The Partition component partitions an input dataset randomly into three subsets called Train, Test, and Validate. The proportion of each subset is defined as a parameter. The union of three subsets need not be the complete initial dataset.

You can partition the dataset using the following partition methods:
  • Random Partition, which randomly divides all the data.
  • Stratified Partition, which divides each sub-category randomly.

In the second case, the dataset needs to have at least one categorical attribute (for example, of type varchar). The initial dataset is subdivided according to the different categorical values of this attribute. Each mutually exclusive subset is then randomly split to obtain the Train, Test, and Validate subsets. This ensures that all "categorical values" or "strata" are present in the sampled subset.

Note that when comparing two or more algorithms in the model comparison chain, the Partition component is mandatory.

Partition Properties
Table 1: Data Preparation Component Properties
Property Description
Partition Method Select the method for partitioning data into train, test, and validation sets.
  • Random
  • Stratified
Random Seed Enter a random number using which you want to perform the calculation.
Partition Rows by Select the method for partitioning rows.
  • Percentage of Rows
  • Number of Rows
Train Set Enter the number of rows or percentage of rows for the train set.
Test Set Enter the number of rows or percentage of rows for the test set.
Validation Set Enter the number of rows or percentage of rows for validation set.
Partition Column Name Enter a name for the new column that contains partitioned values.
Number of Threads Enter the number of threads the algorithm should use for execution.