Create a Data Set
A data set is a generic but logical data reference consisting of a name, type, URL, and a logical destination name. It is an abstraction of the actual data or structures that reside in distributed stores.
Prerequisites
- You have created a project.
- You have created and activated at least one destination in the same project in which you are creating the data set.
- The destination has a connection (connection type: Hadoop Distributed File System (HDFS), Amazon S3, SAP VORA Catalog, Azure Data Lake (ADL), Google Cloud Storage, Azure Storage Blobs (WASB), or SAP HANA SQL) to an SAP Vora or SAP HANA system (system type: SAP VORA or SAP HANA).
Context
- Files and part files in HDFS, ADL, and WASB
- Files in Amazon S3 and Google Cloud Storage
- Directories or folders in HDFS, Amazon S3, ADL, Google Cloud Storage, and WASB
- Tables in SAP Vora
- Tables and Views in SAP HANA
Procedure
- Start the SAP Data Hub cockpit in a Web browser.
-
In the System Status section, choose the
Modeling tile.
The cockpit opens the SAP Data Hub Modeling tool in a new tab in the same browser window.
-
Create a data set.
-
In the navigation pane, right-click the project within which you want to create the data
set and choose
New
Data Set
.
- In the Create Data Set dialog box, provide a name for the data set.
-
If you want to import the data set from the Metadata Catalog, choose
Import Data Set.
You can browse the folders in the Metadata Catalog and select the required data set. If a destination that references a connection necessary for previewing the selected data set already exists in your project, then the tool automatically populates the Destination dropdown list with all such relevant destinations. You can select the required destination or enter a new name to create a new destination.
-
In the Destination dropdown list, select the required
destination.
The tool populates the dropdown list with destinations that have connections to an SAP Vora or SAP HANA system. The connection type is, HDFS, Amazon S3, SAP VORA Catalog, ADL, Google Cloud Storage, WASB, or SAP HANA SQL.
-
Choose Create.
The tool opens a new editor where you can define your data set.
-
In the navigation pane, right-click the project within which you want to create the data
set and choose
-
Define the data set.
Depending on the connection type of the selected destination, in the Properties tab of the data set editor, define the data set.
Connection Type
Next Steps
HDFS, Amazon S3, ADL, Google Cloud Storage, and WASB
1. In the File Path field, browse to the required file in HDFS, Amazon S3, ADL, Google Cloud Storage, or WASB.
2. In the File Format dropdown list, select the file format.
The tool supports Parquet, CSV, and ORC file formats. For CSV files, you can define additional CSV-specific properties such as the character set, column delimiter, and the text delimiter. The tool encodes parses the CSV files based on the values that you provide.
3. Select a value for Includes Header.
This value helps the tool identify whether the selected file contains a header row.
SAP VORA Catalog
1. In the Table Path field, browse to the required SAP Vora table.
2. In the Schema Name field, the tool displays the schema name of the selected table or view.
SAP HANA SQL
1. In the Table Name field, browse to the required SAP HANA table or SAP HANA view.
2. In the Schema Name field, the tool displays the schema name of the selected table or view.
-
(Optional) Preview data.
Preview data of the SAP Vora table, Amazon S3 file, HDFS file, ADL file, Google Cloud Storage file, or WASB file, which you have used in the data set definition.
- In the data set editor, choose the Data Preview tab.
-
(Optional) Show structure.
-
If you want the tool to show or hide the structure of the selected file or table, in the
Show Structure toggle button, select a
value.
The tool displays the structure definition of the selected file or table in the bottom pane.
-
If you want to sort columns based on the column name or column type, in the
Structure Definition section, choose
(Sort), and select the sort order and sort type.
-
If you want the tool to show or hide the structure of the selected file or table, in the
Show Structure toggle button, select a
value.
-
(Optional) Modify structure definition.
The tool supports multiple ways to modify the structure definition of the file or table that you have used in the data set definition. You can use the structure that the tool proposes, or add more columns to the structure definition, or import the structure definition from other files or data sets.
Type Description
Add more columns
This operation is supported only for files in HDFS, Amazon S3, ADL, Google Cloud Storage, and WASB. If you want to modify the structure definition of the selected file by adding more columns, choose + and provide the required column name and column data type.
Auto propose
If you want to use the structure definition that the tool proposes for the selected file or table, in the Structure Definition pane, choose Auto Propose.
Import structure from a file
This operation is supported only for files in HDFS, Amazon S3, ADL, Google Cloud Storage, and WASB. If you want to modify the structure definition of the selected file by importing a structure from another file, in the Structure Definition pane, choose
Import Structure
From File
and browse to the required file.
Import structure from a data set
This operation is supported only for files in HDFS, Amazon S3, ADL, Google Cloud Storage, and WASB. If you want to modify the structure definition of the selected file by importing a structure from another data set in the same project, in the Structure Definition pane, choose
Import Structure
From Data Set
and browse to the required data set in
the same project.
Delete a column from the structure
This operation is supported only for files in HDFS, Amazon S3, ADL, Google Cloud Storage, and WASB. If you want to delete a selected column from the structure, select column and choose
(Delete Column).
-
(Optional) Provide values to parameters.
If you have defined the data set with an SAP HANA object, you can provide values to the input parameters and variables defined for the SAP HANA object. The tool also displays the default values, if any, defined for the variables or parameters.
- In the structure definition pane, choose Auto Propose.
-
Select the Parameters tab.
In the Parameters tab, the tool populates all input parameters and variables defined for the selected SAP HANA object.
-
For each parameter, select the required operator and provide values.
-
If the selected input parameter was configured to accept multiple
values, then in the Parameters tab, choose
+ to define multiple values for the input
parameter.
-
Save changes.
In the global toolbar, choose Save to save the data set.
-
Activate the data set.
After creating a data set object, activate the data set. Activation is necessary to convert the design-time object to its equivalent runtime object in the database.
-
In the global toolbar, choose
(Activate) to activate the data set.
-
In the global toolbar, choose
