Working with Python2.7 and Python3.6 Subengines to Create Operators
Introduction
SAP Data Hub Modeler subengines are a way to allow customers to code their own operators in a particular programming language and make them available for use in pipelines. The Python subengines can execute operators that are written and stored in the way specified by this guide.
When a user runs a graph (pipeline) on SAP Data Hub Modeler, the main engine (which coordinates all subengines) breaks it into large subgraphs so that every operator in the same subgraph can be run by the same subengine. The main engine then fires a subengine process for each subgraph.
This guide will show how you can create new operators for both the Python2.7 and Python3.6 subengines. Both subengines are very similar, with the exception that one accepts only Python 2.7 and the other only Python 3.6 compatible code. Another important difference is the mapping between Python types and SAP Data Hub Modeler types. The table below shows the type mapping for both subengines and SAP Data Hub Modeler's types:
SAP Data Hub Modeler |
Python2.7 |
Python3.6 |
---|---|---|
string |
unicode |
str |
blob |
str |
bytes |
int64 |
int, long |
int |
uint64 |
int, long |
int |
float64 |
float |
float |
byte |
int |
int |
message |
Message |
Message |
[]x |
list |
list |
If you create operators that receives or sends data types that have no mapping to SAP Data Hub Modeler's types, you then have to create ports with type python27 or python36. For example, you can have many Python2Operators connected and all of them with inports and outports of type python27. So those operators can communicate any Python object among themselves (including those which have no corresponding SAP Data Hub Modeler type, e.g.: set, numpy arrays, etc.). There is one caveat: subengine specific types cannot cross the boundary between two SAP Data Hub Modeler's groups (see step 9). However, if you place your Python specific object inside the body of a Message, then the body will be correctly serialized and deserialized with pickle (a Python module) when crossing the boundary between two groups. This behavior of serialization of the body with pickle is specific to the Python subengines and shouldn't be expected in other subengines.
The Message type can be created like Message(body, attributes), where body can be any object, and attributes is a dictionary mapping string to any object. If m is an object of type Message then you can use the commands m.body, and m.attributes to access the fields initialized in the class constructor. The attributes argument is optional in the Message constructor. So if you want to have an empty attributes you can construct your message with just Message(body).
We recommend to not use a Message inside the body of another Message. Instead, use a dictionary inside the body of a Message. This dictionary should then have the keys "Body" and "Attributes". If you use a Message object as the body of another Message, you may get unexpected behavior when transferring the Message across the boundary of different subengines. For example, the inner Message may be automatically converted to a dictionary when crossing the boundaries of different subengines, but it is not converted back to Message when coming back to your operator's subengine. Thus, we recommend to always prefer dictionaries over messages inside the body of a message, because it will not change type during the communication.
# DO NOT DO THE FOLLOWING: inner_msg = Message("body_of_inner_msg") outer_msg = Message(body=inner_msg, attributes={})
# If you want to have a message as the body of another message, do this instead: inner_msg = {"body": "body_of_inner_msg"} outer_msg = Message(inner_msg)
This section is divided into two sections: normal usage and advanced usage. The first section explains the Python subengines usages, which involves only the SAP Data Hub Modeler UI. The second section shows how to create new Python operators in your own machine and then upload those to SAP Data Hub. This advanced way of creating operators is useful when you want to code in your own IDE and create unit tests for your operators.