hana_ml.graph.algorithms package

This package contains various algorithms you can use to explore and work on a graph.

The general pattern is: Create an algorithm object instance (which expects a Graph object in the constructor) and then call execute(<parameters>) on that algorithm instance. This can be combined in one statement.

>>> import hana_ml.graph.algorithms as hga
>>> sp = hga.ShortestPath(graph=g).execute(source="1", target="3")

The execute statement always returns the algorithm instance itself, so that it can used to access the result properties.

>>> print("Vertices", sp.vertices)
>>> print("Edges", sp.edges)
>>> print("Weight:", sp.weight)

If you want to create a new algorithm, have a closer look at algorithm_base.AlgorithmBase.

The following algorithms are available:

KShortestPaths

Neighbors

NeighborsSubgraph

ShortestPath

ShortestPathsOneToAll

StronglyConnectedComponents

WeaklyConnectedComponents

TopologicalSort

CommunitiesLouvain

class hana_ml.graph.algorithms.ShortestPath(graph: Graph)

Bases: AlgorithmBase

Given a source and target vertex_key with optional weight and direction, get the shortest path between them.

The procedure may fail for HANA versions prior to SP05 therefore this is checked at execution time.

The user can take the results and visualize them with libraries such as networkX using the edges() property.

The calculation is started by calling execute().

Examples

>>> import hana_ml.graph.algorithms as hga
>>> sp = hga.ShortestPath(graph=g).execute(source="1", target="3")
>>> print("Vertices", sp.vertices)
>>> print("Edges", sp.edges)
>>> print("Weight:", sp.weight)

execute(source: str, target: str, weight: str = None, direction: str = 'OUTGOING') → ShortestPath

Executes the calculation of the shortest path.

Parameters:

sourcestr

Vertex key from which the shortest path will start.

targetstr

Vertex key from which the shortest path will end.

weightstr, optional

Variable for column name to which to apply the weight.

Defaults to None.

directionstr, optional

OUTGOING, INCOMING, or ANY which determines the algorithm results.

Defaults to OUTGOING.

Returns:

ShortestPath: ShortestPath object instance

property vertices: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the vertices of the shortest path

property edges: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the edges of the shortest path

property weight: float

Weight of the shortest path. Returns 1.0 if no weight column was provided to the execute() call. Returns -1.0 as initial value.

Returns:

float: Weight of the shortest path.

class hana_ml.graph.algorithms.Neighbors(graph: Graph)

Bases: _NeighborsBase

Get a virtual subset of the graph based on a start_vertex and all vertices within a lower_bound->upper_bound count of degrees of separation.

The calculation is started by calling execute().

Examples

>>> import hana_ml.graph.algorithms as hga
>>> nb = hga.Neighbors(graph=g).execute(start_vertex="1")
>>> print("Vertices", nb.vertices)

execute(start_vertex: str, direction: str = 'OUTGOING', lower_bound: int = 1, upper_bound: int = 1) → Neighbors

Executes the calculation of the neighbors.

Parameters:

start_vertexstr

Source from which the subset is based.

directionstr, optional

OUTGOING, INCOMING, or ANY which determines the algorithm results.

Defaults to OUTGOING.

lower_boundint, optional

The count of degrees of separation from which to start considering neighbors. If you want to include the start node into consideration, set lower_bound=0.

Defaults to 1.

upper_boundint, optional

The count of degrees of separation at which to end considering neighbors.

Defaults to 1.

Returns:

Neighbors: Neighbors object instance

property vertices: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the vertices

class hana_ml.graph.algorithms.NeighborsSubgraph(graph: Graph)

Bases: _NeighborsBase

Get a virtual subset of the graph based on a start_vertex and all vertices within a lower_bound->upper_bound count of degrees of separation. The result is similar to neighbors() but includes edges which could be useful for visualization.

Note: The edges table also contains edges between neighbors, if there are any (not only edges from the start vertex).

The calculation is started by calling execute().

Examples

>>> import hana_ml.graph.algorithms as hga
>>> nb = hga.NeighborsSubgraph(graph=g).execute(start_vertex="1")
>>> print("Vertices", nb.vertices)
>>> print("Edges", nb.edges)

execute(start_vertex: str, direction: str = 'OUTGOING', lower_bound: int = 1, upper_bound: int = 1) → NeighborsSubgraph

Executes the calculation of the neighbors with edges.

Parameters:

start_vertexstr

Source from which the subset is based.

directionstr, optional

OUTGOING, INCOMING, or ANY which determines the algorithm results.

Defaults to OUTGOING.

lower_boundint, optional

The count of degrees of separation from which to start considering neighbors. If you want to include the start node into consideration, set lower_bound=0.

Defaults to 1.

upper_boundint, optional

The count of degrees of separation at which to end considering neighbors.

Defaults to 1.

Returns:

NeighborsSubgraph: NeighborsSubgraph object instance

property vertices: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the vertices

property edges: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the edges between neighbors

class hana_ml.graph.algorithms.KShortestPaths(graph: Graph)

Bases: AlgorithmBase

Given a source and target vertex_key with optional weight, get the the Top-k shortest paths between them.

The procedure may fail for HANA versions prior to SP05 therefore this is checked at execution time.

The calculation is started by calling execute().

Examples

>>> import hana_ml.graph.algorithms as hga
>>> topk = hga.KShortestPaths(graph=g).execute(source="1", target="3", k=3)

>>> print("Paths", topk.paths)

execute(source: str, target: str, k: int, weight: str = None) → KShortestPaths

Executes the calculation of the top-k shortest paths.

Parameters:

sourcestr

Vertex key from which the shortest path will start.

targetstr

Vertex key from which the shortest path will end.

kint

Number of paths that will be calculated

weightstr, optional

Variable for column name to which to apply the weight.

Defaults to None.

Returns:

KShortestPaths: KShortestPaths object instance

property paths: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the paths

class hana_ml.graph.algorithms.TopologicalSort(graph: Graph)

Bases: AlgorithmBase

Calculates the topological sort if possible.

A topological ordering of a directed graph is a linear ordering such that for each edge the source vertex comes before the target vertex in the row.

The topological order is not necessarily unique. A directed graph is topological sortable if and only if it does not contain any directed cycles.

There are some common used algorithms for finding a topological order in the input directed graph. Our implementation is based on the depth-first search.

In case the directed graph contains a directed cycle, the Boolean property is_sortable() returns with the value 'False'. Otherwise, the algorithm returns a topological order.

The calculation is started by calling execute().

Examples

>>> import hana_ml.graph.algorithms as hga
>>> ts = hga.TopologicalSort(graph=g).execute()
>>> print("Vertices", ts.vertices)
>>> print("Sortable", ts.is_sortable)

execute() → TopologicalSort

Executes the topological sort.

Returns:

TopologicalSort: TopologicalSort object instance

property vertices: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the topologically sorted vertices

property is_sortable: bool

Flag if the graph is topologically sortable or not. (e.g. false for cyclic graphs)

Returns:

bool: Weight of the shortest path.

class hana_ml.graph.algorithms.ShortestPathsOneToAll(graph: Graph)

Bases: AlgorithmBase

Calculates the shortest paths from a start vertex to all other vertices in the graph.

The procedure may fail for HANA versions prior to SP05 therefore this is checked at execution time.

The calculation is started by calling execute().

Examples

>>> import hana_ml.graph.algorithms as hga
>>> spoa = hga.ShortestPathsOneToAll(graph=g).execute(
        source=2257, direction='OUTGOING', weight='DIST_KM'
    )
>>> print("Vertices", spoa.vertices)
>>> print("Edges", spoa.edges)

execute(source: str, weight: str = None, direction: str = 'OUTGOING') → ShortestPathsOneToAll

Executes the calculation of the shortest paths one to all.

Parameters:

sourcestr

Vertex key from which the shortest paths one to all will start.

weightstr, optional

Variable for column name to which to apply the weight.

Defaults to None.

directionstr, optional

OUTGOING, INCOMING, or ANY which determines the algorithm results.

Defaults to OUTGOING.

Returns:

ShortestPathsOneToAll: ShortestPathOneToAll object instance

property vertices: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the vertices and the distance to the start vertex

property edges: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the edges which are on one of the shortest paths

class hana_ml.graph.algorithms.StronglyConnectedComponents(graph: Graph)

Bases: AlgorithmBase

Identifies the strongly connected components of a graph.

A directed graph is called strongly connected if each of its vertices is reachable of every other ones. Being strongly connected is an equivalence relation and therefore the strongly connected components (scc) of the graph form a partition on the vertex set.

The induced subgraphs on these subsets are the strongly connected components. Note, that each vertex of the graph is part of exactly one scc, but not every edge is part of any scc (if yes, then in only one scc).

In case each scc contains only one vertex, the graph is a directed acyclic graph, as in all strongly connected graphs there should exist a cycle on all of its vertices.

The calculation is started by calling execute().

Examples

>>> import hana_ml.graph.algorithms as hga
>>> scc = hga.StronglyConnectedComponents(graph=g).execute()
>>> print("Vertices", scc.vertices)
>>> print("Components", scc.components)
>>> print("Number of Components", scc.components_count)

execute() → StronglyConnectedComponents

Executes Strongly Connected Components.

Returns:

StronglyConnectedComponents: StronglyConnectedComponents object instance

property vertices: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains an assignment of each vertex to a strongly connected component.

property components: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains strongly connected components and number of vertices in each component.

property components_count: int

Returns:

Int: The number of strongly connected components in the graph.

class hana_ml.graph.algorithms.WeaklyConnectedComponents(graph: Graph)

Bases: AlgorithmBase

Identifies (weakly) connected components.

An undirected graph is called connected if each of its vertices is reachable of every other ones. A directed graph is called weakly connected if the undirected graph, naturally derived from that, is connected. Being weakly connected is an equivalence relation between vertices and therefore the weakly connected components (wcc) of the graph form a partition on the vertex set.

The induced subgraphs on these subsets are the weakly connected components. Note, that each vertex end each edge of the graph is part of exactly one wcc.

The calculation is started by calling execute().

Examples

>>> import hana_ml.graph.algorithms as hga
>>> cc = hga.WeaklyConnectedComponents(graph=g).execute()
>>> print("Vertices", cc.vertices)
>>> print("Components", cc.components)
>>> print("Number of Components", cc.components_count)

execute() → WeaklyConnectedComponents

Executes the connected component.

Returns:

WeaklyConnectedComponents: WeaklyConnectedComponents object instance

property vertices: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains the [wie in strongly connected components]

property components: DataFrame

Returns:

pandas.Dataframe: A Pandas DataFrame that contains connected components and number of vertices in each component.

property components_count: int

Returns:

int: The number of weakly connected components in the graph.

class hana_ml.graph.algorithms.algorithm_base.AlgorithmBase(graph: Graph)

Bases: object

Algorithm base class, every algorithm should derive from.

To implement a new algorithm you have to do the following:

Create a new class, which derives from AlgorithmBase
Implement the constructor
- Set self._graph_script. It can contain {key} templates witch are processed by self._graph_script.format() at runtime. Here is an example from ShortestPath implementation:

>>> self._graph_script = """
                    DO (
                        IN i_startVertex {vertex_dtype} => '{start_vertex}',
                        IN i_endVertex {vertex_dtype} => '{end_vertex}',
                        IN i_direction NVARCHAR(10) => '{direction}',
                        OUT o_vertices TABLE ({vertex_columns}, "VERTEX_ORDER" BIGINT) => ?,
                        OUT o_edges TABLE ({edge_columns}, "EDGE_ORDER" BIGINT) => ?,
                        OUT o_scalars TABLE ("WEIGHT" DOUBLE) => ?
                    )
                    LANGUAGE GRAPH
                    BEGIN
                        GRAPH g = Graph("{schema}", "{workspace}");
                        VERTEX v_start = Vertex(:g, :i_startVertex);
                        VERTEX v_end = Vertex(:g, :i_endVertex);
                        {weighted_definition}
                        o_vertices = SELECT {vertex_select}, :VERTEX_ORDER FOREACH v
                                     IN Vertices(:p) WITH ORDINALITY AS VERTEX_ORDER;
                        o_edges = SELECT {edge_select}, :EDGE_ORDER FOREACH e
                                  IN Edges(:p) WITH ORDINALITY AS EDGE_ORDER;
                        DOUBLE p_weight= DOUBLE(WEIGHT(:p));
                        o_scalars."WEIGHT"[1L] = :p_weight;
                    END;
                    """

Set self._graph_script_vars. This is a dictionary with tuples which define the parameters being used and replaced in the self._graph_script template. The key of the dictionary is the name of the placeholder in the graph script.

You can either map it to parameters from execute()'s signature or assign default values. If you have placeholders in the script, which need to be calculated, you can fo that by overwriting _process_parameters() and set them there. The Each tuple in the list is expected to have the following format:

>>> {
        "name_in_graph_script": (
            "parameter name|None",  : Parameter name expected in
                                      execute() signature. If the placeholder
                                      is only needed in the script, this
                                      can be set to None. Then it's not expected
                                      as a signiture parameter.
            mandatory: True|False   : Needs to be passed to execute()
            type|None,              : Expected Type
            default value|None      : Default value for optional
                                      parameters or parameters not
                                      part of the execute() signature
        )
    }

The default value do not need to be static, but they can be dynamic as well. There are also some convenient functions available to provide the most common replacement strings in scripts. Here is a more complex example from the ShortestPath implementation:

>>> self._graph_script_vars = {
        "start_vertex": ("source", True, None, None),
        "end_vertex": ("target", True, None, None),
        "weight": ("weight", False, str, None),
        "direction": ("direction", False, str, DEFAULT_DIRECTION),
        "schema": (None, False, str, self._graph.schema),
        "workspace": (None, False, str, self._graph.workspace_name),
        "vertex_dtype": (None, False, str, self._graph.vertex_key_col_dtype),
        "vertex_columns": (None, False, str, self._default_vertex_cols()),
        "edge_columns": (None, False, str, self._default_edge_cols()),
        "vertex_select": (None, False, str, self._default_vertex_select("v")),
        "edge_select": (None, False, str, self._default_edge_select("e")),
    }

If necessary, overwrite _process_parameters()
If necessary, overwrite _validate_parameters()

Parameters:

graph: Graph: Graph object, the algorithm is executed on

Methods

`execute`(**kwargs)	Execute the algorithm
`projection_expr_from_cols`(source, variable)	Turn columns into a string for projection expression
`signature_from_cols`(source[, column_filter])	Turn columns into a string for script parameters

static signature_from_cols(source: DataFrame, column_filter: list = None) → str

Turn columns into a string for script parameters

A common pattern in graph scripts is the definition of OUT parameters based on tables: OUT o_edges TABLE (<edge_columns>) => ?. Where the edge_columns are dynamically depending on the graph definition. Therefore they need to be derived at runtime from the give graph's edges or vertices.

This helper method turns the edge or graph columns (or a subset) into a string that can be used in the graph script replacing a placeholder.

Parameters:

source: DataFrame: The DataFrame to read the columns from
column_filter: list, optional: Subset of columns to be considered. If None, all columns are used

Returns:

strString in the form of "<column name>" <data_type>: Example: "edge_id" INT, "from" INT, "to" INT

static projection_expr_from_cols(source: DataFrame, variable: str, column_filter: list = None) → str

Turn columns into a string for projection expression

A common pattern in graph script is the assignment of projection expressions to an OUT parameter. These expressions define a SELECT statement from a container element and therefore need a list of columns that should be selected. Example: SELECT <columns> FOREACH <variable> IN Edges(...)

This helper method turns the edge or graph columns (or a subset) into a string that can be used in the select statement, replacing a placeholder.

Parameters:

source:: The DataFrame to read the columns from
variable:: Name of the iterator variable used in the script. Will prefix the column names.
column_filter:: Subset of columns to be considered. If None,all columns are used

Returns:

strString in the form of ':<variable>."<column name>"': Example: :e."edge_id", :e."from", :e."to"

_default_vertex_cols() → str

Convenient method, that calls signature_from_cols() with just the vertex key column from the graph

Returns:

str'"<key_column_name>" <key_column_datatype>': Example: '"guid" INT'

_default_edge_cols() → str

Convenient method, that calls signature_from_cols() with just the following edge columns from the graph:

edge_key_column
edge_target_column
edge_source_column

Returns:

str'"<key_col>" <col_dtyp>, "<source_col>" <source_dtyp>, "<target_col>" <target_dtyp>': Example: '"edge_id" INT, "from" INT, "to" INT'

_default_vertex_select(variable: str) → str

Convenient method, that calls projection_expr_from_cols() with just the vertex key column from the graph:

Parameters:

variable:: Name of the iterator variable used in the script. Will prefix the column names.

Returns:

str':<variable>."<key_col>"': Example: ':v."guid"'

_default_edge_select(variable: str) → str

Convenient method, that calls projection_expr_from_cols() with just the following edge columns from the graph:

edge_key_column
edge_target_column
edge_source_column

Parameters:

variable:: Name of the iterator variable used in the script. Will prefix the column names.

Returns:

str':<variable>."<key_col>", :<variable>."<source_col>", :<variable>."<target_col>"': Example: ':e."edge_id", :e."from", :e."to"'

_process_parameters(arguments)

Validates and processes the parameters provided when calling execute(). The results are stored in the _template_vals dictionary. Every placeholder key value is mapped to the corresponding value according to the _graph_script_vars definition. The _template_vals dictionary is passed to the _graph_script.format() placeholder replacement.

If you need additional parameter to be added to the dictionary, simply overwrite this method in your algorithm implementation class. Make sure, you still call super()._process_parameters(arguments). You can then add additional placeholder-keys to the dictionary or modify existing values, before they get replaced in the script template.

Parameters:

arguments: kwargs: Arguments provided to execute() by the caller

Examples

>>> super()._process_parameters(arguments)
>>> self._templ_vals["my_value_in_script"] = "Replacement Text"

_validate_parameters()

This method is called after the input parameters are processed and mapped. It can be overwritten to implement specific validity checks.

Examples

>>> # Version check
>>> if int(self._graph.connection_context.hana_major_version()) < 4:
        raise EnvironmentError(
            "SAP HANA version is not compatible with this method"
        )

execute(**kwargs)

Execute the algorithm

Parameters:

kwargs:: List of keyword parameters as specified by the implementing class

Returns:

selfAlgorithmBase for command chaining