Modeling Guide for SAP Data Hub

Groups, Tags, and Dockerfiles

This section describes working with Groups, Tags, and Dockerfiles in SAP Data Hub.

Groups

When you execute a graph with groups, each group's subgraph will run in a different Docker container with a possibly different Docker image. The Docker image used by a group is automatically selected based on the tags associated with it. On the other hand, operators inside the same group are assured to run in the same node. Each group can also be configured with a different restart policy, tags or multiplicity.

The most common use case for using groups is to distribute work among many compute nodes either through partitioning the graph into many groups, adding multiplicity larger than 1 for a group, or both. This can lead to better graph throughput and cluster utilization. A user may also want to have different restart policies for different groups. For example, in a group one may want the container to be redeployed when it fails, while in a second group the user may want the graph to terminate if this group fails. Finally, one can also create a group if currently there is no dockerfile satisfying the requirements of a graph. In this case, the user needs to partition the graph into groups in such a way that for each of them exists at least one dockerfile which satisfies its requirements.

A graph with no explicitly defined group will have only one group (called the default group) which contains all graph´s operators. The user can further partition the graph by assigning subset of operators to an explicit group. For example, suppose a graph has the following topology: A->B->C->D->E. If no explicit groups are defined then all those operator will run in the default group. Now suppose we create two explicit groups: group-1 containing A and B; and group-2 containing only E. The topology would look like this: (A->B)->C->D->(E). Now the graph has three groups: group-1 (A, B); default group (C, D); and group-2 (E).

Tags and Dockerfiles

When executing a graph, a Docker image will be selected for each group based on its tags. If more than one Dockerfile satisfying those tags are found, the application will select the one with the fewest tags (if multiple satisfying dockerfiles have the fewest number of tags then ties are broken arbitrarily). The selected Dockerfile will be built upon graph execution if the corresponding image has not already been cached.

Each tag represents a runtime requirement for the group (for example, packages and libraries) and is specified by a pair (<resouce_id>: <resouce_version>). Some examples are: ("python36": ""), ("opencv": "") and ("tornado": "5.0.2"). An empty <resource_version> implies that a Docker image containing any version of The tags associated with a group are the union of each operator tags and the ones specified in the group configuration. Once the resulting tags are calculated, a Dockerfile satisfying all the group´s tags is searched in the repository directory, resource_id is enough. If two operators in the same group have a tag with the same <resource_id>, but different <resource_version>, then when calculating the resulting tags for the group, those two tags will be merged into one and the final <resource_version> will assume the value of the more specific one if both versions are compatible. If they are incompatible, then an error is thrown. For example, if one operator has the tag ("foo": "1.1") and another has ("foo": "1.1.2"), then the result of the merge will be ("foo": "1.1.2") because the "1.1.2" is more specific than "1.1". On the other hand, if the last operator required version "2.1.1" then an error would happen because the versions would not be compatible since they don't share a common prefix.

When searching for Dockerfiles that satisfy the resulting tags of a group, it is a must to know whether a particular group's tag is satisfied by some tag of a Dockerfile. A group's tag G (<resourceG_id>: <resourceG_version>) is satisfied by a Dockerfile tag D (<resourceD_id>: <resourceD_version>) if and only if <resourceG_id> equals <resourceD_id>, <resourceG_version> shares a common prefix with <resourceD_version>, and the former is not more specific than the latter. For example, the group tag ("foo": "1.1") is satisfied by the Dockerfile tag ("foo": "1.1") or ("foo": "1.1.2"), but not by ("foo": "1"), ("foo": ""), or ("bar": "1.1"). If more than one Dockerfile satisfies the tags of a group, the specific <resource_versions> defined in each Dockerfile will not be used as a tie breaking criterion.

If no Dockerfile can satisfy one of the groups requirement in your graph then an error message will be shown when trying to run it. In this case, there are two options. You can either split the problematic group into smaller groups so that each one of them matches some existing Dockerfile, or you can create a new Dockerfile which attends all the group requirements.

Example

Suppose we have 3 Dockerfiles defined in our repository with the following associated tags:

  • com.sap.d1: {"foo": "", "bar": "1.2.3", "baz": "", "qux": "1.1.1", "abc": ""}
  • com.sap.d2: {"foo": "", "corge": "2.2.2"}
  • com.sap.d3: {"xyz": ""}

The tags associated with a group are the union of each operator tags and the ones specified inAlso, suppose we have a graph with the following topology: (A->B->C)->D->E. So, it has one explicit group (group-1) containing (A, B, C) and the default group containing (D, E). Assume the groups' configuration and operators have the following tags:

  • group-1 configuration: {"foo": "", "bar": ""}
  • default group configuration: {}
  • operator A: {"baz": ""}
  • operator B: {"bar": "1.2", "qux": "1.1.1"}
  • operator C: {}
  • operator D: {}
  • operator E: {"foo": ""}
By making the union of the tags associated with each group we get the following result:
  • group-1: {"foo": "", "bar": "1.2", "baz": "", "qux": "1.1.1"}
  • default group: {"foo": ""}

Now, we need to find a Dockerfile to be used by each group. Clearly, group-1's aggregated tags can only be satisfied by the com.sap.d1 dockerfile. This Dockerfile has one more tag ("abc": "") than group-1 requires. Also, the "1.2.3" version for "bar" in com.sap.d1 satisfies group-1, which requires version "1.2" which is more generic than "1.2.3". On the other hand, the default group has 2 Dockerfiles which satisfy its aggregated tags: com.sap.d1 and com.sap.d2. In such scenarios, we choose the one with fewer tags, which in this case is com.sap.d2.

Operator Tags

You can view or edit the operators tags in the operator editor. Right-click the operator and choose Edit. The operators that run on subengines may have implicit tags. Those tags will not appear on the operator editor screen, but they will be included in the requirements of a group using this operator. You can check what are the implicit tags for each subengine in the subengine documentation.