Understanding Predictive Goal and Training Roles for Variables

A variable corresponds to a column in a dataset or a dimension in a planning model. The observations relating to each variable correspond to the rows. Variables that have been specified as a target/signal, or an entity identifier, are not considered as influencers. Unless you exclude certain influencers, all other variables are treated as influencers. The training retains the most significant ones for the predictive model reports for debriefing.

To build a predictive model, you define the following variable roles:
Role Description Example
Target or Signal The variable that you want to explain, or predict the values for.
Example
  • Classification predictive scenario:

    You want to predict if a customer will answer to your mailing or not. Your training data source containing the customer information contains the target <responded to my mailing>. This target may take the values <Yes> or <No>. If the value <Yes> is the least frequent value, the application considers that value to be the targeted value for the target.

  • Regression predictive scenario:

    You want to predict the number of complaints that your customer support will receive this week. Your target is <Number of customer complaints> and it will take <numerical> values.

  • Time series predictive scenario:

    You want to forecast the product sales for the next 6 months. <Product sales> is your signal.

Date The variable used for the date values.
Note
This variable is mandatory for a time series predictive scenario.
The date formats that should be used in your dataset are the following:
  • YYYY-MM-DD
  • YYYY/MM/DD
  • YYYY/MM-DD
  • YYYY-MM/DD
  • YYYYMMDD
  • YYYY-MM-DD hh:mm:ss

Here, YYYY stands for the year, MM for the month,DD for the day of the month, hh stands for the hour, mm stands for the minutes, and ss stands for the seconds.

Note
Let's say you use the YYYY-MM-DD date format, you can create Time Series Predictive Scenarios where the date granularity can be:
  • Year expressed as YYYY-01-01 where YYYY is variable (moving year).
  • Month expressed as YYYY-MM-01 where YYYY-MM is variable (moving month).
  • Weekly data in the date format YYYY-MM-DD taking for instance the 1st day of the week as the characters DD (moving week).
  • Day (calendar dates) expressed as YYYY-MM-DD where YYYY-MM-DD is variable (moving day).
Entity Optionally used in a time series predictive scenario. It’s the identifier variable that you want to use to split up the predictive model into entites, with each one producing its own predictive model, so you get distinct predictions for each entity.

The predictive model can then catch behaviors that are specific to a given entity, and so produce more accurate predictions.

The entity can be a dimension in the data, for example Region, Store, or Product Family.

Example
You want to forecast the energy consumption by industry sector for the next 6 months. Your signal value is <Energy consumption> and your entity is <Industry sector>. You will get predictions and performance indicators for each industry sector: commercial, industrial, residential, transportation.
Influencer

The influencers are variables that describe your data and which serve to explain a target. Unless excluded, all variables that aren't already selected as a target or signal, or an entity identifier, are considered as influencers, with only the most significant ones being retained after training for debriefing.

During the predictive model creation, you can decide to exclude influencers from the training process, these are not taken into account to compute the predictive model, not included in the statistics for the predictive model, not retrieved from the data source, and not needed when you apply the predictive model to an application data source.

Remember
You should exclude influencers that are directly related to the target, especially variables that contain indirectly a target variable. Statisticians call these variables as "leakers" or "leak variables". This will produce a wrong predictive model with wrong performance indicator unable to produce prediction.
Example
If a predictive model has the target variable <has bought the product Yes/No>, you should exclude the influencer <Billing amount> if it contains the cost for the product.
Tip
If there is a variable that is influencing the prediction at very high level then there is a chance that it is a leak variable.

Excluding influencers that have no influence on the targets (for example <account number>) can help speed up the training process.

Example

Your company is marketing two products A and B.

You have a database, which contains references to:
  • 1,500 of your customers. You know which product, A or B, each customer has purchased.
  • 10,000 prospects. You want to know which product each customer is likely to purchase.
The variables <name>, <age>, <address>, and <socio-occupational class> are your influencers: they allow you to generate a predictive model capable of explaining and predicting the value of the target <product purchased>.