Sizing of Replications
Replication Management Service (RMS) allows you to replicate a number of tables (or other data structures) from a source system to a target system.
Sizing of Replication Flows in Data Intelligence Cloud
Like all other workloads, replications are executed in the form of pipelines, which are called worker graphs in this case. The worker graphs run in the background when replication flows are started via the replication flow user interface. Worker graphs are internal and cannot be modified like regular pipelines being created by a user in the Modeler application. A replication flow allows one tenant to replicate an arbitrary number (N) of data sets (for example, tables or CDS views) from one source system to one target system. Each table constitutes a task of the replication flow. All data are replicated in relatively small chunks (typically between 10 MB and 30 MB during the initial load), such that the memory usage of the worker graphs remains constant.
Number of worker graphs per replication flow = Number of connections per replication flow / 5
To scale out replications in order to achieve better throughput, there are two options:
| Option | Result |
|---|---|
| Increase the number of replication flows. For example, don't replicate all tables as tasks within one replication flow, but distribute the tables (tasks) over several replication flows that can run in parallel. | Increasing the number of replication flows will create for each replication flow background processes in the source and target systems. Scale them carefully so as not to overload the source or target system with background processes. |
|
Increase the number of connections within a replication flow to get more threads and, equivalently, more worker graphs. |
With the start of multiple connections to the source and target systems, background processes get blocked for the handling of the data packages. Scale the number of connections carefully to not block too many resources in the system. |
Scale-out will work only up to a certain point, because at some point a threshold in throughput will be reached due to constraints of the source and target systems, or due to other limitations. Therefore, up to this threshold (which depends on the details of the connected systems), the trade-off between higher throughput and higher resource consumption (higher TCO) versus lower throughput and lower resource consumption (lower TCO) is a matter of configuration.
RAM_RMS = A * [B/5] * 2 GB
CPU_RMS = A * [B/5] * 1.5
If there is a certain requirement for total throughput (in terms of [MB/s] or [GB/h]), and if this required throughput is still in the realm of linear scalability for the given constellation, it can be achieved by configuring A and B such that the throughput that is achieved per worker graph depends on the specific scenario. However, from various measurements it is known that around 5 MB/s (uncompressed data volume) is often a good estimate.
A * [B/5] >= (required throughput) / (throughput per worker graph)