hana_ml.graph package
Hana Graph Package
The following classes and functions are available:
- class hana_ml.graph.Graph(connection_context: ConnectionContext, workspace_name: str, schema: str = None)
Bases:
object
Represents a graph consisting of a vertex and edges table that was created from a set of pandas dataframes, existing tables that are changed into a graph workspace, or through an existing graph workspace.
At runtime you can access the following attributes:
connection_context
workspace_schema
workspace_name
vertex_tbl_schema
vertex_tbl_name
vertex_key_column
vertex_key_col_dtype: DB datatype of the vertex key column
vertices_hdf: hana_ml.DataFrame of the vertices
edge_tbl_name
edge_tbl_schema
edge_key_column
edge_source_column
edge_target_column
edge_key_col_dtype: DB datatype of the edge key column
edges_hdf: hana_ml.DataFrame of the edges
- Parameters:
- connection_contextConnectionContext
The connection to the SAP HANA system.
- schemastr
Name of the schema.
- workspace_namestr
Name that references the HANA Graph workspace.
- describe() Series
Generate descriptive statistics.
Descriptive statistics include degree, density, counts (edges, vertices, self loops, triangles), if it has unconnected nodes...
The triangles count and the is connected data are only available in the cloud edition. These information will not be available on an on-premise installation.
- Returns:
- pandas.Series
Statistics
- degree_distribution() DataFrame
Generate the degree distribution of the graph.
- Returns:
- pandas.DataFrame
Degree distribution
- drop(include_vertices=False, include_edges=False)
Drops the current graph workspace and all the associated procedures.
You can also specify to delete the vertices and edges tables if required.
Note: The instance of the graph object is not usable anymore afterwards.
- Parameters:
- include_verticesbool, optional, default: False
Also drop the Vertices Table
- include_edgesbool, optional, default: False
Also drop the Edge Table
- has_vertices(vertices) bool
check if they list of vertices are in the graph.
Edge case is possible where source tables are not up to date of the workspace.
- Parameters:
- verticeslist
Vertex keys expected to be in the graph.
- Returns:
- bool
True if the vertices exist otherwise False.
- vertices(vertex_key=None) DataFrame
Get the table representing vertices within a graph. If there is a vertex, check it.
- Parameters:
- vertex_keyoptional
Vertex key expected to be in the graph.
- Returns:
- pd.Dataframe
The dataframe is empty, if no vertices are found.
- edges(vertex_key=None, edge_key=None, direction='OUTGOING') DataFrame
Get the table representing edges within a graph. If there is a vertex_key, then only get the edges respective to that vertex.
- Parameters:
- vertex_keyoptional
Vertex key from which to get edges.
Defaults to None.
- edge_keyoptional
Edge key from which to get edges.
Defaults to None.
- directionstr, optional
OUTGOING, INCOMING, or ANY which determines the algorithm results. Only applicable if vertex_key is not None.
Defaults to OUTGOING.
- Returns:
- pd.Dataframe
- in_edges(vertex_key) DataFrame
Get the table representing edges within a graph filtered on a vertex_key and its incoming edges.
- Parameters:
- vertex_keystr
Vertex key from which to get edges.
- Returns:
- pd.Dataframe
- out_edges(vertex_key)
Get the table representing edges within a graph filtered on a vertex_key and its outgoing edges.
- Parameters:
- vertex_keystr
Vertex key from which to get edges.
- Returns:
- pd.Dataframe
- source(edge_key) DataFrame
Get the vertex that is the source/from/origin/start point of an edge.
- Parameters:
- edge_key
Edge key from which to get source vertex.
- Returns:
- pd.Dataframe
- target(edge_key) DataFrame
Get the vertex that is the source/from/origin/start point of an edge.
- Parameters:
- edge_key
Edge key from which to get source vertex.
- Returns:
- pd.Dataframe
- subgraph(workspace_name, schema: str = None, vertices_filter: str = None, edges_filter: str = None, force: bool = False) Graph
Creates a vertices or edges induced subgraph based on SQL filters to the respective data frame. The SQL filter has to be valid for the dataframe, that will be filtered, otherwise you'll get a runtime exception.
You can provide either a filter to the vertices dataframe or to the edges dataframe (not both). Based on the provided filter, a new consistent graph workspace is created based on HANA DB views.
If you for example create an edge filter, a db view for edges based on this filter is created. In addition, a db view for the vertices is created, which filters the original vertices table, so that it only contains the vertices included in the filtered edges view.
Note: The view names are generated based on <workspace name>_SGE_VIEW and <workspace name>_SGV_VIEW
- Parameters:
- workspace_namestr
Name of the workspace expected in the SAP HANA Graph workspaces of the ConnectionContext.
- schemastr
Schema name of the workspace. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
Defaults to the current schema.
- vertices_filterstr
SQL filter clause, that will be applied to the vertices dataframe
- edges_filterstr
SQL filter clause, that will be applied to the edges dataframe
- forcebool, optional
If force is True, then an existing workspace is overwritten during the creation process.
Defaults to False.
- Returns:
- Graph
A virtual HANA Graph with functions inherited from the individual vertex and edge HANA Dataframes.
Examples
>>> sg = my_graph.subgraph( "sg_geo_filtered", vertices_filter=""lon_lat_GEO".ST_Distance(ST_GeomFromWKT( 'POINT(-93.09230195104271 27.810864761841017)', 4326)) < 40000", # pylint: disable=line-too-long ) >>> print(sg)
>>> sg = my_graph.subgraph( "sg_test", vertices_filter='"value" BETWEEN 300 AND 400' ) >>> print(sg)
>>> sg = my_graph.subgraph("sg_test", edges_filter='"rating" > 4') >>> print(sg)
- hana_ml.graph.create_graph_from_dataframes(connection_context: ConnectionContext, vertices_df, vertex_key_column: str, edges_df, workspace_name: str, schema: str = None, edge_source_column: str = 'from', edge_target_column: str = 'to', edge_key_column: str = None, object_type_as_bin: bool = False, drop_exist_tab: bool = True, allow_bigint: bool = False, force_tables: bool = True, force_workspace: bool = True, replace: bool = False, geo_cols: list = None, srid: int = 4326) Graph
Create a HANA Graph workspace based on an edge and a vertices dataframe. The respective vertices table is created implicitly based on the from and to columns of the edges.
Expects either HANA dataframes or pandas dataframes as input. If they are pandas then they will be transformed into hana_ml.DataFrame.
- Parameters:
- connection_contextConnectionContext
The connection to the SAP HANA system.
- vertices_dfpandas.DataFrame or hana_ml.DataFrame
Table of data containing vertices and their keys that correspond with the edge frame.
- edges_dfpandas.DataFrame or hana_ml.DataFrame
Table of data containing edges that link keys within the vertex frame.
- workspace_namestr
Name of the workspace expected in the SAP HANA Graph workspaces of the ConnectionContext.
- schemastr
Schema name of the workspace. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
Defaults to the current schema.
- edge_source_columnstr
Column name in the e_frame containing only source vertex keys that exist within the vertex_key_column of the n_frame.
Defaults to 'from'.
- edge_target_columnstr
Column name in the e_frame containing the unique id of the edge.
Defaults to 'to'.
- edge_key_columnstr
Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.
Defaults to None.
- vertex_key_columnstr
Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.
Defaults to None.
- object_type_as_binbool, optional
If True, the object type will be considered CLOB in SAP HANA.
Defaults to False.
- drop_exist_tabbool, optional
If force is True, drop the existing table when drop_exist_tab is True and truncate the existing table when it is False.
Defaults to False.
- allow_bigintbool, optional
allow_bigint decides whether int64 is mapped into INT or BIGINT in HANA.
Defaults to False.
- force_tablesbool, optional
- If force_tables is True, then the SAP HANA tables for vertices and edges
are truncated or dropped.
Defaults to False.
- force_workspacebool, optional
If force_workspace is True, then an existing workspace is overwritten during the creation process.
Defaults to False.
- replacebool, optional
If replace is True, then the SAP HANA table performs the missing value handling.
Defaults to True.
- geo_colslist, optional but required for spatial functions with Pandas dataframes
Specifies the columns of the Pandas dataframe, which are treated as geometries. List elements can be either strings or tuples.
The geo_cols will be tested against columns in the vertices and edges dataframes. Depending on the existence, they will be distributed to the according table. geo_cols, that don't exist in either dataframe will be ignored. The srid applies to both dataframes.
If you need a more deliberate management, consider to transform the Pandas dataframes with
create_dataframe_from_pandas()
to HANA dataframes first. Here you can specifically control the transformation according to the function features.Strings represent columns which contain geometries in (E)WKT format. If the provided DataFrame is a GeoPandas DataFrame, you do not need to add the geometry column to the geo_cols. It will be detected and added automatically.
The column name in the HANA Table will be <column_name>_GEO
Tuples must consist of two or strings: (<longitude column>, <latitude column>)
longitude column: Dataframe column, that contains the longitude values
latitude column: Dataframe column, that contains the latitude values
They will be combined to a POINT(<longiturd> <latitude>) geometry.
The column name in the HANA Table will be <longitude>_<latitude>_GEO
Defaults to None.
- sridint, optional but required for spatial functions with Pandas dataframes
Spatial reference system id.
Defaults to 4326.
- Returns:
- Graph
A virtual HANA Graph with functions inherited from the individual vertex and edge HANA Dataframes.
Examples
>>> v_pdf = pd.read_csv("nodes.csv") >>> e_pdf = pd.read_csv("edges.csv") >>> hg = create_graph_from_dataframes( self._connection_context, vertices_df=v_pdf, edges_df=e_pdf, workspace_name="test_factory_ws", vertex_key_column="guid", geo_cols=[("lon", "lat")], force_tables=True, force_workspace=True) >>> print(hg)
- hana_ml.graph.create_graph_from_edges_dataframe(connection_context: ConnectionContext, edges_df, workspace_name: str, schema: str = None, edge_source_column: str = 'from', edge_target_column: str = 'to', edge_key_column: str = None, object_type_as_bin: bool = False, drop_exist_tab: bool = True, allow_bigint: bool = False, force_tables: bool = True, force_workspace: bool = True, replace: bool = False, geo_cols: list = None, srid: int = 4326) Graph
Create a HANA Graph workspace based on an edge dataframe. The respective vertices table is created implicitly based on the from and to columns of the edges.
Expects either a hana dataframe or pandas dataframe as input for the edges table. If it is pandas then it will be transformed into a hana_ml.DataFrame.
- Parameters:
- connection_contextConnectionContext
The connection to the SAP HANA system.
- edges_dfpandas.DataFrame or hana_ml.DataFrame
Table of data containing edges that link keys within the vertex frame.
- workspace_namestr
Name of the workspace expected in the SAP HANA Graph workspaces of the ConnectionContext.
- schemastr
Schema name of the workspace. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
Defaults to the current schema.
- edge_source_columnstr
Column name in the e_frame containing only source vertex keys that exist within the vertex_key_column of the n_frame.
Defaults to 'from'.
- edge_target_columnstr
Column name in the e_frame containing the unique id of the edge.
Defaults to 'to'.
- edge_key_columnstr
Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.
Defaults to None.
- object_type_as_binbool, optional
If True, the object type will be considered CLOB in SAP HANA.
Defaults to False.
- drop_exist_tabbool, optional
If force is True, drop the existing table when drop_exist_tab is True and truncate the existing table when it is False.
Defaults to False.
- allow_bigintbool, optional
allow_bigint decides whether int64 is mapped into INT or BIGINT in HANA.
Defaults to False.
- force_tablesbool, optional
- If force_tables is True, then the SAP HANA tables for vertices and edges
are truncated or dropped.
Defaults to False.
- force_workspacebool, optional
If force_workspace is True, then an existing workspace is overwritten during the creation process.
Defaults to False.
- replacebool, optional
If replace is True, then the SAP HANA table performs the missing value handling.
Defaults to True.
- geo_colslist, optional but required for spatial functions with Pandas dataframes
Specifies the columns of the Pandas dataframe, which are treated as geometries. List elements can be either strings or tuples.
The geo_cols will be tested against columns in the vertices and edges dataframes. Depending on the existence, they will be distributed to the according table. geo_cols, that don't exist in either dataframe will be ignored. The srid applies to both dataframes.
If you need a more deliberate management, consider to transform the Pandas dataframes with
create_dataframe_from_pandas()
to HANA dataframes first. Here you can specifically control the transformation according to the function features.Strings represent columns which contain geometries in (E)WKT format. If the provided DataFrame is a GeoPandas DataFrame, you do not need to add the geometry column to the geo_cols. It will be detected and added automatically.
The column name in the HANA Table will be <column_name>_GEO
Tuples must consist of two or strings: (<longitude column>, <latitude column>)
longitude column: Dataframe column, that contains the longitude values
latitude column: Dataframe column, that contains the latitude values
They will be combined to a POINT(<longiturd> <latitude>) geometry.
The column name in the HANA Table will be <longitude>_<latitude>_GEO
Defaults to None.
- sridint, optional but required for spatial functions with Pandas dataframes
Spatial reference system id.
Defaults to 4326.
- Returns:
- Graph
A virtual HANA Graph with functions inherited from the individual vertex and edge HANA Dataframes.
Examples
>>> e_pdf = pd.read_csv(self.e_path) >>> hg = create_graph_from_edges_dataframe( connection_context=self._connection_context, edges_df=e_pdf, workspace_name="factory_ws", edge_source_column="from", edge_target_column="to", edge_key_column="edge_id", drop_exist_tab=True, force_tables=True, force_workspace=True) >>> print(hg)
- hana_ml.graph.create_graph_from_hana_dataframes(connection_context: ConnectionContext, vertices_df: DataFrame, vertex_key_column: str, edges_df: DataFrame, edge_key_column: str, workspace_name: str, schema: str = None, edge_source_column: str = 'from', edge_target_column: str = 'to', force: bool = False) Graph
Creates a graph workspace based on HANA DataFrames. This method can be used, uf some features are required, which are not provided in the
create_graph_from_dataframes()
(e.g. you need to set a chunk_size, when transferring the Pandas DataFrame to an HANA DataFrame. This is not offered increate_graph_from_dataframes()
Based on the input dataframes the following logic applies for creating/selecting the source catalog objects from HANA for the graph workspace:
If both dataframes are based on database tables, both have a valid key column and the source and target columns in the vertices table are not nullable, then the graph workspace is based on the tables directly
If one of the tables does not fulfill above's criteria, or if at least one of the dataframes is based on a view or a SQL statement, respective views (on a table or an SQL view) are generated, which will be used as a base for the graph workspace
- Parameters:
- connection_contextConnectionContext
The connection to the SAP HANA system.
- vertices_dfhana_ml.DataFrame
Table of data containing vertices and their keys that correspond with the edge frame.
- vertex_key_columnstr
Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.
- edges_dfhana_ml.DataFrame
Table of data containing edges that link keys within the vertex frame.
- edge_key_columnstr
Column name in the n_frame containing the vertex key which uniquely identifies the vertex in the edge table.
- workspace_namestr
Name of the workspace expected in the SAP HANA Graph workspaces of the ConnectionContext.
- schemastr
Schema name of the workspace. If this value is not provided or set to None, then the value defaults to the ConnectionContext's current schema.
Defaults to the current schema.
- edge_source_columnstr
Column name in the e_frame containing only source vertex keys that exist within the vertex_key_column of the n_frame.
Defaults to 'from'.
- edge_target_columnstr
Column name in the e_frame containing the unique id of the edge.
Defaults to 'to'.
- forcebool, optional
If force is True, then an existing workspace is overwritten during the creation process.
Defaults to False.
- Returns:
- Graph
A virtual HANA Graph with functions inherited from the individual vertex and edge HANA Dataframes.
Examples
>>> v_df = create_dataframe_from_pandas( connection_context=connection_context, pandas_df=pd.read_csv('nodes.csv'), table_name="factory_test_table_vertices", force=True, primary_key="guid") >>> e_df = create_dataframe_from_pandas( connection_context=connection_context, pandas_df=pd.read_csv('edges.csv'), table_name="factory_test_table_edges", force=True, primary_key="edge_id", not_nulls=["from", "to"]) >>> hg = create_graph_from_hana_dataframes( connection_context=connection_context, vertices_df=v_df, vertex_key_column="guid", edges_df=e_df, edge_key_column="edge_id", workspace_name="test_factory_ws", force=True) >>> print(hg)
- hana_ml.graph.discover_graph_workspace(connection_context: ConnectionContext, workspace_name: str, schema: str = None)
Provide detail information about a specific Graph Workspace The function returns the following per GWS:
SCHEMA_NAME, WORKSPACE_NAME, EDGE_SCHEMA_NAME, EDGE_TABLE_NAME, EDGE_SOURCE_COLUMN_NAME, EDGE_TARGET_COLUMN_NAME, EDGE_KEY_COLUMN_NAME, VERTEX_SCHEMA_NAME, VERTEX_TABLE_NAME, VERTEX_KEY_COLUMN_NAME.
- Parameters:
- connection_contextConnectionContext
Connection to the given SAP HANA Database and implied Graph Workspace.
- workspace_namestr
Workspace name to be discovered
- schemastr
- Schema of the workspace. If none is provided, the schema of the
connection_context is used
- Returns:
- dict
Dictionary with the workspace attributes
- hana_ml.graph.discover_graph_workspaces(connection_context: ConnectionContext)
Provide a view of the Graph Workspaces (GWS) on a given connection to SAP HANA. This provides the basis for creating a HANA graph from existing GWS instead of only creating them from vertex and edge tables. Use the SYS SQL provided for Graph Workspaces so a user can create a HANA graph from one of them. The SQL returns the following per GWS:
SCHEMA_NAME, WORKSPACE_NAME, CREATE_TIMESTAMP, USER_NAME, EDGE_SCHEMA_NAME, EDGE_TABLE_NAME, EDGE_SOURCE_COLUMN_NAME, EDGE_TARGET_COLUMN_NAME, EDGE_KEY_COLUMN_NAME, VERTEX_SCHEMA_NAME, VERTEX_TABLE_NAME, VERTEX_KEY_COLUMN_NAME, IS_VALID.
Due to the differences in Cloud and On-Prem Graph workspaces, the SQL creation requires different methods to derive the same summary pattern for GWS as defined above. For this reason, 2 internal functions return the summary.
- Parameters:
- connection_contextConnectionContext
Connection to the given SAP HANA Database and implied Graph Workspace.
- Returns:
- list
The list of tuples returned by fetchall but with headers included and as a dict.