About Preparing Data
This topic contains background information for data preparation during various workflow scenarios.
When you initially acquire raw data, it may need to be cleansed, and it may include a large number of rows and columns, many of which might not be relevant to your analysis or planning. Data preparation involves pruning, transforming, and formatting your raw data before it's used in charts and tables.
Data preparation involves tasks such as these:
- Fixing anomalies in the data (such as dates with inconsistent formatting), filling in empty cells, and so on.
- Filtering data, hiding columns, or splitting columns into two.
- Assigning dimension types to the data columns, or setting columns to be dimension attributes.
- Mapping incoming data to existing dimensions, or mapping incoming numerical data to measures.
- To improve performance, when an imported dataset is large, a random sample of the data is loaded first. You then apply any desired data transforms, and when you're satisfied with the data sample, you apply the transforms to the entire dataset.
- Depending on whether you're importing data into a story, a new model, or an existing model, some of the data preparation issues mentioned in this topic may not apply.
When importing data to a new model, columns containing text are typically identified as Dimensions, and numeric data may be identified as Measures. If the assigned dimension type isn't appropriate, you can change it to another type; for example, to a Date or Organization dimension. For details, see About Dimensions.
If your data contains columns of latitude and longitude coordinate data, or state and county data (for the USA only), you can create a Location dimension. Select the (Geo Enrichment) icon under Transformations, and choose either By Area Name or By Coordinates.
When creating locations using coordinate data, location dimensions can be based on a combination of latitude and longitude pairs. You also have the option to specify a column as an identifier for the location dimension, and another column as the location description.
For more information, see Creating a Model with Coordinate or Area Data for Geospatial Analysis.
When importing data into a new model, there is an upper limit applied to the number of unique members in a dimension, to maintain good performance. See System Requirements and Technical Prerequisites.
However, for non-planning-enabled models only, you can import dimensions with more than the maximum number of members, with these restrictions:
- In the Data Integration view, these dimensions cannot have any dimension attributes added to them, such as description or property.
- Once the dimensions are imported into Modeler, they will have only one ID column, and will be read-only.
- The dimensions can’t be used as exception aggregation dimensions or required dimensions.
- The dimensions can’t be referenced in formulas.
While you are preparing data in the Data Integration view you can create a calculated column based on input from another column and the application of a formula expression.
For listing of supported functions see Supported Functions for Calculated Columns.
Dimension attributes are information that is not suitable to be standalone dimensions.
For example, if you have a Customers column, and a Phone Numbers column, you could set Phone Numbers to be an attribute of the Customers dimension.
You can use dimension attributes to create charts, filters, calculations, input controls, linked analyses, and tables (with “import data” models only).
These are the available dimension attributes:
- Description: The column can be used for descriptive labels of the
dimension members when the member IDs (the unique identifiers for the dimension
members) are technical and not easily understandable.
For example, if your imported data contains a pair of related columns Product_Description and Product_ID, containing data descriptions and data identifiers, you can set Product_Description to be the Description attribute for the Product_ID dimension. Note that the Product_ID column would then need to contain unique identifiers for the dimension members.
Computer and equipment
- Property: The column represents information that is related to the dimension; for example, phone numbers.
- Parent-Child Hierarchy (Parent): The column is the parent of the
parent/child hierarchy pair.
For example, if your imported data contains the two columns Country and City, you can set the Country column to be the parent of the Country-City hierarchy.
The hierarchy column is a free-format text attribute where you can enter the ID value of the parent member. By maintaining parent-child relationships in this way, you can build up a data hierarchy that is used when viewing the data to accumulate high-level values that can be analyzed at lower levels of detail.
You can also create level-based hierarchies when your data is organized into levels, such as Product Category, Product Group, and Product. When the data is displayed in a story, hierarchies can be expanded or collapsed. In the toolbar, select (Level Based Hierarchy). For more information, see Creating a Level-Based Hierarchy.
For more information about hierarchies, see Hierarchies.
- Currency: If you set a column to be an Organization dimension, the Currency attribute is available. The Organization dimension offers an organizational analysis of the account data, based, for example, on geographic entities. You can add the Currency attribute to provide currency information for the geographic entities.
During data import, anomalies in your data can prevent the data from being imported properly, or prevent the model from being created. If any of the following issues are found in your data, those data cells are highlighted, and messages in the Details panel explain the issues. You'll need to resolve these issues before the data can be fully imported or the model can be created.
- In account dimensions only, dimension member IDs cannot contain these restricted characters , ; : ' [ ] =.
- Numeric data cells in measures cannot contain non-numeric characters.
- When importing data to an existing model, all dimensions of the existing model need to be mapped.
- When importing data to an existing model, cells in a column that is mapped to an existing dimension must match the existing dimension members. Unmatched values will result in those rows being omitted from the model.
- For stories, if any member IDs are empty, you can type values in those cells, or select Delete empty rows in the Details panel to remove those rows.
- When creating a new model, if member IDs are empty, they are automatically filled with the “#” value if you select the Fill applicable empty ID cells with the "#" value option. Otherwise, those rows are omitted from the model.
- In dimensions and properties, a single member ID cannot correspond to different
Descriptions in multiple rows (but a single Description can correspond to
multiple member IDs).
For example, if member IDs are employee numbers, and Descriptions are employee names, it's allowable to have more than one employee with the same name, but not allowable to have more than one employee with the same member ID.
- In a Date dimension column, cell values must match the format specified in the
Details panel. The following date formats are
- mmm yyyy
- mmm: JAN/Jan/jan
- mmmm: JANUARY/January/january
- q: 1/2/3/4
- qq: 01/02/03/04
- Latitude and longitude columns, from which location dimensions are created, must contain values within the valid latitude and longitude ranges.
- For planning-enabled models, in a hierarchy measure, non-leaf-node members are
not allowed.NoteFor analytic (non-planning-enabled) models only, non-leaf node members are allowed, but be aware of the effects of this behavior. Example: in an organizational chart that includes employee salaries, the Manager has her individual salary, and her staff members have their own salaries as well. In a visualization, do you expect the Manager’s data point to reflect her individual salary, or the sum of her staff members' salaries?
- Nodes in a hierarchy cannot be circular. For example, this hierarchy arrangement is not allowed: USA > California > San Francisco > California.
At least one measure must be defined for the model.
If the source data doesn't contain a column that can be used directly as a measure, you can create a measure column based on a count of an existing column that contains 100% unique values. The generated measure column will contain the value 1 for all rows, representing the count of the unique values.
If you proceed to create the model without having created a measure column, a measure column will be generated automatically, based on the first column that contains 100% unique values.
To create the measure column yourself, select a column with 100% unique values, and then select Create Count in the Details panel.
If there isn't an existing column that contains 100% unique values, a measure column will be generated anyway when the model is created, containing the value 0 for all rows.
- There can be only one of each of these dimension types: