hana_ml.graph package

Hana Graph Package

The following classes and functions are available:

Graph

create_graph_from_dataframes()

create_graph_from_edges_dataframe()

create_graph_from_hana_dataframes()

discover_graph_workspace()

discover_graph_workspaces()

class hana_ml.graph.Graph(connection_context: ConnectionContext, workspace_name: str, schema: str = None)

Bases: object

Represents a graph consisting of a vertex and edges table that was created from a set of pandas dataframes, existing tables that are changed into a graph workspace, or through an existing graph workspace.

At runtime you can access the following attributes:

connection_context
workspace_schema
workspace_name
vertex_tbl_schema
vertex_tbl_name
vertex_key_column
vertex_key_col_dtype: DB datatype of the vertex key column
vertices_hdf: hana_ml.DataFrame of the vertices
edge_tbl_name
edge_tbl_schema
edge_key_column
edge_source_column
edge_target_column
edge_key_col_dtype: DB datatype of the edge key column
edges_hdf: hana_ml.DataFrame of the edges

Parameters:

connection_contextConnectionContext: The connection to the SAP HANA system.
schemastr: Name of the schema.
workspace_namestr: Name that references the HANA Graph workspace.

describe() → Series

Generate descriptive statistics.

Descriptive statistics include degree, density, counts (edges, vertices, self loops, triangles), if it has unconnected nodes...

The triangles count and the is connected data are only available in the cloud edition. These information will not be available on an on-premise installation.

Returns:

pandas.Series: Statistics

degree_distribution() → DataFrame

Generate the degree distribution of the graph.

Returns:

pandas.DataFrame: Degree distribution

drop(include_vertices=False, include_edges=False)

Drops the current graph workspace and all the associated procedures.

You can also specify to delete the vertices and edges tables if required.

Note: The instance of the graph object is not usable anymore afterwards.

Parameters:

include_verticesbool, optional, default: False: Also drop the Vertices Table
include_edgesbool, optional, default: False: Also drop the Edge Table

has_vertices(vertices) → bool

check if they list of vertices are in the graph.

Edge case is possible where source tables are not up to date of the workspace.

Parameters:

verticeslist: Vertex keys expected to be in the graph.

Returns:

bool: True if the vertices exist otherwise False.

vertices(vertex_key=None) → DataFrame

Get the table representing vertices within a graph. If there is a vertex, check it.

Parameters:

vertex_keyoptional: Vertex key expected to be in the graph.

Returns:

pd.Dataframe: The dataframe is empty, if no vertices are found.

edges(vertex_key=None, edge_key=None, direction='OUTGOING') → DataFrame

Get the table representing edges within a graph. If there is a vertex_key, then only get the edges respective to that vertex.

Parameters:

vertex_keyoptional

Vertex key from which to get edges.

Defaults to None.

edge_keyoptional

Edge key from which to get edges.

Defaults to None.

directionstr, optional

OUTGOING, INCOMING, or ANY which determines the algorithm results. Only applicable if vertex_key is not None.

Defaults to OUTGOING.

Returns:

pd.Dataframe

in_edges(vertex_key) → DataFrame

Get the table representing edges within a graph filtered on a vertex_key and its incoming edges.

Parameters:

vertex_keystr: Vertex key from which to get edges.

Returns:

pd.Dataframe

out_edges(vertex_key)

Get the table representing edges within a graph filtered on a vertex_key and its outgoing edges.

Parameters:

vertex_keystr: Vertex key from which to get edges.

Returns:

pd.Dataframe

source(edge_key) → DataFrame

Get the vertex that is the source/from/origin/start point of an edge.

Parameters:

edge_key: Edge key from which to get source vertex.

Returns:

pd.Dataframe

target(edge_key) → DataFrame

Get the vertex that is the source/from/origin/start point of an edge.

Parameters:

edge_key: Edge key from which to get source vertex.

Returns:

pd.Dataframe

subgraph(workspace_name, schema: str = None, vertices_filter: str = None, edges_filter: str = None, force: bool = False) → Graph

Creates a vertices or edges induced subgraph based on SQL filters to the respective data frame. The SQL filter has to be valid for the dataframe, that will be filtered, otherwise you'll get a runtime exception.

You can provide either a filter to the vertices dataframe or to the edges dataframe (not both). Based on the provided filter, a new consistent graph workspace is created based on HANA DB views.

If you for example create an edge filter, a db view for edges based on this filter is created. In addition, a db view for the vertices is created, which filters the original vertices table, so that it only contains the vertices included in the filtered edges view.

Note: The view names are generated based on <workspace name>_SGE_VIEW and <workspace name>_SGV_VIEW

Parameters:

workspace_namestr

Name of the workspace expected in the SAP HANA Graph workspaces of the ConnectionContext.

schemastr

Schema name of the workspace. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

vertices_filterstr

SQL filter clause, that will be applied to the vertices dataframe

edges_filterstr

SQL filter clause, that will be applied to the edges dataframe

forcebool, optional

If force is True, then an existing workspace is overwritten during the creation process.

Defaults to False.

Returns:

Graph: A virtual HANA Graph with functions inherited from the individual vertex and edge HANA Dataframes.

Examples

>>> sg = my_graph.subgraph(
        "sg_geo_filtered",
        vertices_filter=""lon_lat_GEO".ST_Distance(ST_GeomFromWKT( 'POINT(-93.09230195104271 27.810864761841017)', 4326)) < 40000",  # pylint: disable=line-too-long
    )
>>> print(sg)

>>> sg = my_graph.subgraph(
        "sg_test", vertices_filter='"value" BETWEEN 300 AND 400'
    )
>>> print(sg)

>>> sg = my_graph.subgraph("sg_test", edges_filter='"rating" > 4')
>>> print(sg)

hana_ml.graph.create_graph_from_dataframes(connection_context: ConnectionContext, vertices_df, vertex_key_column: str, edges_df, workspace_name: str, schema: str = None, edge_source_column: str = 'from', edge_target_column: str = 'to', edge_key_column: str = None, object_type_as_bin: bool = False, drop_exist_tab: bool = True, allow_bigint: bool = False, force_tables: bool = True, force_workspace: bool = True, replace: bool = False, geo_cols: list = None, srid: int = 4326) → Graph

Create a HANA Graph workspace based on an edge and a vertices dataframe. The respective vertices table is created implicitly based on the from and to columns of the edges.

Expects either HANA dataframes or pandas dataframes as input. If they are pandas then they will be transformed into hana_ml.DataFrame.

Parameters:

connection_contextConnectionContext

The connection to the SAP HANA system.

vertices_dfpandas.DataFrame or hana_ml.DataFrame

Table of data containing vertices and their keys that correspond with the edge frame.

edges_dfpandas.DataFrame or hana_ml.DataFrame

Table of data containing edges that link keys within the vertex frame.

workspace_namestr

Name of the workspace expected in the SAP HANA Graph workspaces of the ConnectionContext.

schemastr

Schema name of the workspace. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

edge_source_columnstr

Column name in the e_frame containing only source vertex keys that exist within the vertex_key_column of the n_frame.

Defaults to 'from'.

edge_target_columnstr

Column name in the e_frame containing the unique id of the edge.

Defaults to 'to'.

edge_key_columnstr

Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.

Defaults to None.

vertex_key_columnstr

Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.

Defaults to None.

object_type_as_binbool, optional

If True, the object type will be considered CLOB in SAP HANA.

Defaults to False.

drop_exist_tabbool, optional

If force is True, drop the existing table when drop_exist_tab is True and truncate the existing table when it is False.

Defaults to False.

allow_bigintbool, optional

allow_bigint decides whether int64 is mapped into INT or BIGINT in HANA.

Defaults to False.

force_tablesbool, optional

If force_tables is True, then the SAP HANA tables for vertices and edges: are truncated or dropped.

Defaults to False.

force_workspacebool, optional

If force_workspace is True, then an existing workspace is overwritten during the creation process.

Defaults to False.

replacebool, optional

If replace is True, then the SAP HANA table performs the missing value handling.

Defaults to True.

geo_colslist, optional but required for spatial functions with Pandas dataframes

Specifies the columns of the Pandas dataframe, which are treated as geometries. List elements can be either strings or tuples.

The geo_cols will be tested against columns in the vertices and edges dataframes. Depending on the existence, they will be distributed to the according table. geo_cols, that don't exist in either dataframe will be ignored. The srid applies to both dataframes.

If you need a more deliberate management, consider to transform the Pandas dataframes with create_dataframe_from_pandas() to HANA dataframes first. Here you can specifically control the transformation according to the function features.

Strings represent columns which contain geometries in (E)WKT format. If the provided DataFrame is a GeoPandas DataFrame, you do not need to add the geometry column to the geo_cols. It will be detected and added automatically.

The column name in the HANA Table will be <column_name>_GEO

Tuples must consist of two or strings: (<longitude column>, <latitude column>)

longitude column: Dataframe column, that contains the longitude values

latitude column: Dataframe column, that contains the latitude values

They will be combined to a POINT(<longiturd> <latitude>) geometry.

The column name in the HANA Table will be <longitude>_<latitude>_GEO

Defaults to None.

sridint, optional but required for spatial functions with Pandas dataframes

Spatial reference system id.

Defaults to 4326.

Returns:

Graph: A virtual HANA Graph with functions inherited from the individual vertex and edge HANA Dataframes.

Examples

>>> v_pdf = pd.read_csv("nodes.csv")
>>> e_pdf = pd.read_csv("edges.csv")
>>> hg = create_graph_from_dataframes(
        self._connection_context,
        vertices_df=v_pdf,
        edges_df=e_pdf,
        workspace_name="test_factory_ws",
        vertex_key_column="guid",
        geo_cols=[("lon", "lat")],
        force_tables=True,
        force_workspace=True)
>>> print(hg)

hana_ml.graph.create_graph_from_edges_dataframe(connection_context: ConnectionContext, edges_df, workspace_name: str, schema: str = None, edge_source_column: str = 'from', edge_target_column: str = 'to', edge_key_column: str = None, object_type_as_bin: bool = False, drop_exist_tab: bool = True, allow_bigint: bool = False, force_tables: bool = True, force_workspace: bool = True, replace: bool = False, geo_cols: list = None, srid: int = 4326) → Graph

Create a HANA Graph workspace based on an edge dataframe. The respective vertices table is created implicitly based on the from and to columns of the edges.

Expects either a hana dataframe or pandas dataframe as input for the edges table. If it is pandas then it will be transformed into a hana_ml.DataFrame.

Parameters:

connection_contextConnectionContext

The connection to the SAP HANA system.

edges_dfpandas.DataFrame or hana_ml.DataFrame

Table of data containing edges that link keys within the vertex frame.

workspace_namestr

Name of the workspace expected in the SAP HANA Graph workspaces of the ConnectionContext.

schemastr

Schema name of the workspace. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

edge_source_columnstr

Column name in the e_frame containing only source vertex keys that exist within the vertex_key_column of the n_frame.

Defaults to 'from'.

edge_target_columnstr

Column name in the e_frame containing the unique id of the edge.

Defaults to 'to'.

edge_key_columnstr

Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.

Defaults to None.

object_type_as_binbool, optional

If True, the object type will be considered CLOB in SAP HANA.

Defaults to False.

drop_exist_tabbool, optional

If force is True, drop the existing table when drop_exist_tab is True and truncate the existing table when it is False.

Defaults to False.

allow_bigintbool, optional

allow_bigint decides whether int64 is mapped into INT or BIGINT in HANA.

Defaults to False.

force_tablesbool, optional

If force_tables is True, then the SAP HANA tables for vertices and edges: are truncated or dropped.

Defaults to False.

force_workspacebool, optional

If force_workspace is True, then an existing workspace is overwritten during the creation process.

Defaults to False.

replacebool, optional

If replace is True, then the SAP HANA table performs the missing value handling.

Defaults to True.

geo_colslist, optional but required for spatial functions with Pandas dataframes

Specifies the columns of the Pandas dataframe, which are treated as geometries. List elements can be either strings or tuples.

The geo_cols will be tested against columns in the vertices and edges dataframes. Depending on the existence, they will be distributed to the according table. geo_cols, that don't exist in either dataframe will be ignored. The srid applies to both dataframes.

If you need a more deliberate management, consider to transform the Pandas dataframes with create_dataframe_from_pandas() to HANA dataframes first. Here you can specifically control the transformation according to the function features.

Strings represent columns which contain geometries in (E)WKT format. If the provided DataFrame is a GeoPandas DataFrame, you do not need to add the geometry column to the geo_cols. It will be detected and added automatically.

The column name in the HANA Table will be <column_name>_GEO

Tuples must consist of two or strings: (<longitude column>, <latitude column>)

longitude column: Dataframe column, that contains the longitude values

latitude column: Dataframe column, that contains the latitude values

They will be combined to a POINT(<longiturd> <latitude>) geometry.

The column name in the HANA Table will be <longitude>_<latitude>_GEO

Defaults to None.

sridint, optional but required for spatial functions with Pandas dataframes

Spatial reference system id.

Defaults to 4326.

Returns:

Graph: A virtual HANA Graph with functions inherited from the individual vertex and edge HANA Dataframes.

Examples

>>> e_pdf = pd.read_csv(self.e_path)
>>> hg = create_graph_from_edges_dataframe(
        connection_context=self._connection_context,
        edges_df=e_pdf,
        workspace_name="factory_ws",
        edge_source_column="from",
        edge_target_column="to",
        edge_key_column="edge_id",
        drop_exist_tab=True,
        force_tables=True,
        force_workspace=True)
>>> print(hg)

hana_ml.graph.create_graph_from_hana_dataframes(connection_context: ConnectionContext, vertices_df: DataFrame, vertex_key_column: str, edges_df: DataFrame, edge_key_column: str, workspace_name: str, schema: str = None, edge_source_column: str = 'from', edge_target_column: str = 'to', force: bool = False) → Graph

Creates a graph workspace based on HANA DataFrames. This method can be used, uf some features are required, which are not provided in the create_graph_from_dataframes() (e.g. you need to set a chunk_size, when transferring the Pandas DataFrame to an HANA DataFrame. This is not offered in create_graph_from_dataframes()

Based on the input dataframes the following logic applies for creating/selecting the source catalog objects from HANA for the graph workspace:

If both dataframes are based on database tables, both have a valid key column and the source and target columns in the vertices table are not nullable, then the graph workspace is based on the tables directly
If one of the tables does not fulfill above's criteria, or if at least one of the dataframes is based on a view or a SQL statement, respective views (on a table or an SQL view) are generated, which will be used as a base for the graph workspace

Parameters:

connection_contextConnectionContext

The connection to the SAP HANA system.

vertices_dfhana_ml.DataFrame

Table of data containing vertices and their keys that correspond with the edge frame.

vertex_key_columnstr

Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.

edges_dfhana_ml.DataFrame

Table of data containing edges that link keys within the vertex frame.

edge_key_columnstr

Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.

workspace_namestr

Name of the workspace expected in the SAP HANA Graph workspaces of the ConnectionContext.

schemastr

Schema name of the workspace. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.

Defaults to the current schema.

edge_source_columnstr

Column name in the e_frame containing only source vertex keys that exist within the vertex_key_column of the n_frame.

Defaults to 'from'.

edge_target_columnstr

Column name in the e_frame containing the unique id of the edge.

Defaults to 'to'.

forcebool, optional

If force is True, then an existing workspace is overwritten during the creation process.

Defaults to False.

Returns:

Graph: A virtual HANA Graph with functions inherited from the individual vertex and edge HANA Dataframes.

Examples

>>> v_df = create_dataframe_from_pandas(
        connection_context=connection_context,
        pandas_df=pd.read_csv('nodes.csv'),
        table_name="factory_test_table_vertices",
        force=True,
        primary_key="guid")
>>> e_df = create_dataframe_from_pandas(
        connection_context=connection_context,
        pandas_df=pd.read_csv('edges.csv'),
        table_name="factory_test_table_edges",
        force=True,
        primary_key="edge_id",
        not_nulls=["from", "to"])
>>> hg = create_graph_from_hana_dataframes(
        connection_context=connection_context,
        vertices_df=v_df,
        vertex_key_column="guid",
        edges_df=e_df,
        edge_key_column="edge_id",
        workspace_name="test_factory_ws",
        force=True)
>>> print(hg)

hana_ml.graph.discover_graph_workspace(connection_context: ConnectionContext, workspace_name: str, schema: str = None)

Provide detail information about a specific Graph Workspace The function returns the following per GWS:

SCHEMA_NAME, WORKSPACE_NAME, EDGE_SCHEMA_NAME, EDGE_TABLE_NAME, EDGE_SOURCE_COLUMN_NAME, EDGE_TARGET_COLUMN_NAME, EDGE_KEY_COLUMN_NAME, VERTEX_SCHEMA_NAME, VERTEX_TABLE_NAME, VERTEX_KEY_COLUMN_NAME.

Parameters:

connection_contextConnectionContext

Connection to the given SAP HANA Database and implied Graph Workspace.

workspace_namestr

Workspace name to be discovered

schemastr

Schema of the workspace. If none is provided, the schema of the: connection_context is used

Returns:

dict: Dictionary with the workspace attributes

hana_ml.graph.discover_graph_workspaces(connection_context: ConnectionContext)

Provide a view of the Graph Workspaces (GWS) on a given connection to SAP HANA. This provides the basis for creating a HANA graph from existing GWS instead of only creating them from vertex and edge tables. Use the SYS SQL provided for Graph Workspaces so a user can create a HANA graph from one of them. The SQL returns the following per GWS:

SCHEMA_NAME, WORKSPACE_NAME, CREATE_TIMESTAMP, USER_NAME, EDGE_SCHEMA_NAME, EDGE_TABLE_NAME, EDGE_SOURCE_COLUMN_NAME, EDGE_TARGET_COLUMN_NAME, EDGE_KEY_COLUMN_NAME, VERTEX_SCHEMA_NAME, VERTEX_TABLE_NAME, VERTEX_KEY_COLUMN_NAME, IS_VALID.

Due to the differences in Cloud and On-Prem Graph workspaces, the SQL creation requires different methods to derive the same summary pattern for GWS as defined above. For this reason, 2 internal functions return the summary.

Parameters:

connection_contextConnectionContext: Connection to the given SAP HANA Database and implied Graph Workspace.

Returns:

list: The list of tuples returned by fetchall but with headers included and as a dict.