Datasets for Smart Predict

Dataset can be used as data source for Smart Predict. However, they must have a certain structure and must contain some mandatory information depending on the type of predictive scenario you are creating and where you are in the modeling process.

What are datasets?

A dataset is a collection of data that is usually presented in a table. Each row represents an observation (which is the object of your interest), and each column represents information corresponding to this observation. One of the columns represents the target variable.

Depending on the nature of the data contained in the dataset, you will be able to leverage it to create a certain type of predictive model for your specific need.

The graphic below summarizes which dataset is used depending on the step of the predictive process:

There are sizing restrictions based on acquired dataset. Refer to the related link below for more information.
You can also work with live dataset. For more information, refer to Creating a Live Dataset
Input Datasets

Input datasets are stored in the Files section of SAP Analytics Cloud. You create an input dataset using the Start of the navigation pathMain Menu Next navigation step Create Next navigation step DatasetEnd of the navigation path.

In SAP Analytics Cloud, you can use one of the following types of input datasets:
  • Acquired: Data is imported (copied) and stored in SAP Analytics Cloud. Acquired dataset have already been prepared on your computer (supported formats are .TXT, .CSV and .XLSX ).
  • Live: Data is stored in the source system. It isn't copied to SAP Analytics Cloud, so any changes in the source data are available immediately if no structural changes are brought to the table or SQL view. You can connect to live data and create a live dataset.
An input dataset is used to train the predictive model (training dataset) or is used to apply the predictive model (application dataset).
The input datasets used to train and apply a predictive model must come from the same data source location. You can't apply a predictive model on a live dataset if it was trained with an acquired dataset, nor can you apply a predictive model on an acquired dataset if it was trained using a live one. However, you can have several predictive models trained and applied with live and acquired datasets in the same predictive scenario.
While using live datasets, both live datasets (training and apply datasets) must come from the same SAP HANA system: you cannot train a predictive model with a live dataset with data from SAP HANA system 1 and then apply this predictive model on a live dataset with data coming from SAP HANA system 2.
For live datasets, any data changes you make to your tables and SQL views in your SAP HANA on-premise system appear immediately in live datasets. However, to update your predictive model, you need to do a retraining.
To create a predictive model, you must have a training dataset available that contains actual data observed in the past.
Time series predictive models may also contain data for the future.

For example, if you have included additional variables in our data model to refine the forecasts, the values for these variables should be filled for the forecasted period as well.

Then, you apply the predictive model to an application dataset.

In the case of a time series predictive model, you use the same dataset for the train and apply step.

For acquired datasets: When using acquired datasets, your input dataset (training and application dataset) must not contain more than 1,000 columns. While applying the predictive model to an application dataset, Smart Predict generates additional columns. The application process can get blocked if your application dataset already contains many columns and risks crossing the limit of 1,000 columns.
Generated Datasets

When you click the Apply button to get your predictions, a dataset containing your predictions is generated. You can choose in which directory you want to save your dataset. By default, they are saved in this folder: Start of the navigation pathMain Menu Next navigation step Browse Next navigation step FilesEnd of the navigation path.

When a dataset already exists with the same name as the dataset you are saving, then the following rules apply:
  • If both datasets have identical variables, the new dataset will automatically replace the existing one.
  • If the datasets are different, you receive an Apply Failed message. To continue, save your dataset under a different name.

The generated dataset contains the predictions and any additional columns you have requested.

You can then use this generated dataset to create a story or an SAP Analytics Cloud model. However if you intend to get updates in your generated dataset, SAP recommends to use it in a story: If you reapply your predictive model and erase the generated dataset with an updated one, the story will be updated. For example, if you have added rows to your apply dataset, the generated predictions for these new rows will be added to the story. However, if you decide to use the generated dataset in an SAP Analytics Cloud model, note that the SAP Analytics Cloud model won't be updated.