About Preparing Data

This topic contains background information for data preparation during various workflow scenarios.

When you initially acquire raw data, it may need to be cleansed, and it may include a large number of rows and columns, many of which might not be relevant to your analysis or planning. Data preparation involves pruning, transforming, and formatting your raw data before it's used in charts and tables.

Data preparation involves tasks such as these:

  • Fixing anomalies in the data (such as dates with inconsistent formatting), filling in empty cells, and so on.
  • Filtering data, hiding columns, or splitting columns into two.
  • Converting data to number, text, or date format.
  • Assigning dimension types to the data columns, or setting columns to be dimension attributes.
  • Mapping incoming data to existing dimensions, or mapping incoming numerical data to measures.
Note
  • To improve performance, when an imported dataset is large, a random sample of the data is loaded first. You then apply any desired data transforms, and when you're satisfied with the data sample, you apply the transforms to the entire dataset.
  • Depending on whether you're importing data into a story, a new model, or an existing model, some of the data preparation issues mentioned in this topic may not apply.
Choosing Dimension Types

When importing data to a new model, columns containing text are typically identified as Dimensions, and numeric data may be identified as Measures. If the assigned dimension type isn't appropriate, you can change it to another type; for example, to a Time or Organization dimension. For details, see About Dimensions.

For stories, you can select Enable date drill down in the Details panel to enhance a Date dimension, so that it can contain a drillable hierarchy (for example, years, months, and days). Note: you can have multiple Date dimensions, but only one of them can be enhanced.

Location Dimensions

If your data contains columns of latitude and longitude data, or state and county data (for the USA only), you can create a Location dimension.

In a column header, select Start of the navigation path Next navigation step Create LocationEnd of the navigation path, and then select By Coordinate (if you want to use latitude and longitude data to create the location dimension) or By Region (if you want to use state and county data to create the location dimension). Then choose the appropriate columns for the location information.

For more information, see Creating a Model with Coordinate or Area Data for Geospatial Analysis.

Maximum Number of Dimension Members

When importing data into a new model, there is an upper limit applied to the number of unique members in a dimension, to maintain good performance. See System Requirements and Technical Prerequisites.

However, for non-planning-enabled models only, you can import dimensions with more than the maximum number of members, with these restrictions:

  • In the Data Integration view, these dimensions cannot have any dimension attributes added to them, such as description or property.
  • Once the dimensions are imported into Modeler, they will have only one ID column, and will be read-only.
  • The dimensions can’t be used as exception aggregation dimensions or required dimensions.
  • The dimensions can’t be referenced in formulas.
Dimension Attributes

Dimension attributes are information that is not suitable to be standalone dimensions.

All dimensions have, at minimum, both Description and ID attributes. Typically, for each dimension, the imported data column is used as the dimension's Description attribute. An ID column is then automatically generated for each dimension, to contain the unique IDs of the rows. However, if your imported data contains a column that you want to use as the unique IDs of the rows, you specify one of the other columns to be the Description attribute for that dimension.

In addition to the mandatory Description and ID attributes, other columns in your imported data can be defined as dimension attributes.

For example, if you have a Customers column, and a Phone Numbers column, you could set Phone Numbers to be an attribute of the Customers dimension.

You can use dimension attributes to create charts, filters, calculations, input controls, linked analyses, and tables (with “import data” models only).

These are the available dimension attributes:

  • ID: The ID dimension attribute represents the unique row identifiers for the dimension. The ID attribute is typically generated automatically, but you can select a column to be the ID attribute.

    Note that the ID attribute is either generated automatically, or defined when you add a Description attribute. You can't select the ID attribute from the attribute list in the Details panel.

  • Description: The column represents the main descriptions of the dimension members.

    For example, if your imported data contains a pair of related columns Product_Description and Product_ID, containing data descriptions and data identifiers, you can set Product_Description to be the Description attribute for the Product_ID dimension. Note that the Product_ID column would then need to contain unique row identifiers for the dimension.

    ID

    Description

    X123

    Computer and equipment

    Y456

    Books

  • Property: The column represents information that is related to the dimension; for example, phone numbers.
  • Parent-Child Hierarchy (Parent): The column is the parent of the parent/child hierarchy pair.

    For example, if your imported data contains the two columns Country and City, you can set the Country column to be the parent of the Country-City hierarchy.

    The hierarchy column is a free-format text attribute where you can enter the ID value of the parent member. By maintaining parent-child relationships in this way, you can build up a data hierarchy that is used when viewing the data to accumulate high-level values that can be analyzed at lower levels of detail.

    For more information about hierarchies, see Creating a Hierarchy.

  • Currency: If you set a column to be an Organization dimension, the Currency attribute is available. The Organization dimension offers an organizational analysis of the account data, based, for example, on geographic entities. You can add the Currency attribute to provide currency information for the geographic entities.
Data Quality Considerations

During data import, anomalies in your data can prevent the data from being imported properly, or prevent the model from being created. If any of the following issues are found in your data, those data cells are highlighted, and messages in the Details panel explain the issues. You'll need to resolve these issues before the data can be fully imported or the model can be created.

  • Dimension member IDs cannot contain these restricted characters , ; : ' [ ] =.
  • Numeric data cells in measures cannot contain non-numeric characters.
  • When importing data to an existing model, all dimensions of the existing model need to be mapped.
  • When importing data to an existing model, cells in a column that is mapped to an existing dimension must match the existing dimension members. Unmatched values will result in those rows being omitted from the model.
  • For stories, if any member IDs are empty, you can type values in those cells, or select Delete empty rows in the Details panel to remove those rows.
  • When creating a new model, if member IDs are empty, they are automatically filled with default values if you select the Fill applicable empty ID cells with a default value option. Otherwise, those rows are omitted from the model.
  • In dimensions and properties, a single member ID cannot correspond to different Descriptions in multiple rows (but a single Description can correspond to multiple member IDs).

    For example, if member IDs are employee numbers, and Descriptions are employee names, it's allowable to have more than one employee with the same name, but not allowable to have more than one employee with the same member ID.

  • In a Time dimension column, cell values must match the time format specified in the Details panel. The following time formats are supported:
    • dd-mm-yy
    • dd.mm.yy
    • dd/mm/yy
    • dd-mm-yyyy
    • dd.mm.yyyy
    • dd/mm/yyyy
    • dd-mmm-yyyy
    • dd-mmmm-yyyy
    • mm-dd-yy
    • mm.dd.yy
    • mm/dd/yy
    • mm-dd-yyyy
    • mm.dd.yyyy
    • mm/dd/yyyy
    • mm.yyyy
    • mmm yyyy
    • yy-mm-dd
    • yy.mm.dd
    • yy/mm/dd
    • yyyq
    • yyyy
    • yyyymm
    • yyyy-mm
    • yyyy.mm
    • yyyy/mm
    • yyyymmdd
    • yyyy-mm-dd
    • yyyy-mm/dd
    • yyyy.mm.dd
    • yyyy/mm-dd
    • yyyy/mm/dd
    • yyyy.mmm
    • yyyyqq

    Examples:

    • mmm: JAN/Jan/jan
    • mmmm: JANUARY/January/january
    • q: 1/2/3/4
    • qq: 01/02/03/04
  • Latitude and longitude columns, from which location dimensions are created, must contain values within the valid latitude and longitude ranges.
  • For planning-enabled models, in a hierarchy measure, non-leaf-node members are not allowed.
    Note
    For analytic (non-planning-enabled) models only, non-leaf node members are allowed, but be aware of the effects of this behavior. Example: in an organizational chart that includes employee salaries, the Manager has her individual salary, and her staff members have their own salaries as well. In a visualization, do you expect the Manager’s data point to reflect her individual salary, or the sum of her staff members' salaries?
  • Nodes in a hierarchy cannot be circular. For example, this hierarchy arrangement is not allowed: USA > California > San Francisco > California.
Mapping Considerations
  • At least one measure must be defined for the model.

    If the source data doesn't contain a column that can be used directly as a measure, you can create a measure column based on a count of an existing column that contains 100% unique values. The generated measure column will contain the value 1 for all rows, representing the count of the unique values.

    If you proceed to create the model without having created a measure column, a measure column will be generated automatically, based on the first column that contains 100% unique values.

    To create the measure column yourself, select a column with 100% unique values, and then select Create Count in the Details panel.

    If there isn't an existing column that contains 100% unique values, a measure column will be generated anyway when the model is created, containing the value 0 for all rows.

  • There can be only one of each of these dimension types:
    • Time
    • Account
    • Organization
    • Version