gen_ai_hub.evaluations.models package

class ArtifactSource

Bases: object

Extends the artifact object with the relative path user can provide inside to be used for EvaluationConfig Example Usage:

>>> ArtifactSource(
        artifact={
            "id": "xyfz-rtyu-2456-ojns-yu6s",
            "name": "dataset-artifact",
            "url": "ai://default/eval_dataset"
            ...
        },
        path= "rootfolder/data.csv,
        file_type="csv"
    )
>>> ArtifactSource(
        artifact="xyfz-rtyu-2456-ojns-yu6s",
        path="rootfolder/data.json,
        file_type="json"
    )
)

__init__(file_type, artifact, path=None)

Parameters:: artifact(Union[str,Artifact]): Can just provide the artifact id as a string or the Artifact object of the AI_API_Client sdk. path(Optional[str]): Relative path within the artifact path provided and should point to a single file. file_type(Literal["csv", "json", "jsonl"]): One of the supported file_types

Parameters:

file_type (Literal['csv', 'json', 'jsonl'])
artifact (str | Artifact)
path (str | None)

class Dataset

Bases: object

Dataset object for the evaluations flow.

The Dataset class accepts various source types for evaluation datasets including local file paths (as strings or Path objects) or AI Core artifacts.

Parameters:: source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource

Examples:

Using a Path object:

>>> Dataset(Path("data/sample.json"))

Using a string path:

>>> Dataset("data/sample.json")

Using an ArtifactSource with artifact dictionary:

>>> Dataset(
...     ArtifactSource(
...         artifact={
...             "id": "xyfz-rtyu-2456-ojns-yu6s",
...             "name": "dataset-artifact",
...             "url": "ai://default/eval_dataset"
...         },
...         path="rootfolder/data.csv",
...         file_type="csv"
...     )
... )

Using an ArtifactSource with artifact ID:

>>> Dataset(
...     ArtifactSource(
...         artifact="xyfz-rtyu-2456-ojns-yu6s",
...         path="rootfolder/data.csv",
...         file_type="csv"
...     )
... )

__init__(source)

Initialize a Dataset instance.

Parameters:: source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource

property file_type: str | None

Infer the file type from the source.

For ArtifactSource, returns the explicitly set file_type. For file paths, infers the type from the file extension.

Returns:: File type (e.g., "json", "jsonl", "csv") or None if cannot be determined
Return type:: Optional[str]

class EvaluationConfig

Bases: object

Defines the evaluation configuration object for the Evaluations flow.

This class encapsulates all configuration parameters needed to run an evaluation job, including the model/template configuration, dataset, metrics, and execution settings.

At least one of the following must be provided:

llm and template combination (using orchestration_v2 models)
orchestration_registry_reference (UUID of a registered orchestration configuration)

Parameters:

dataset_config (Dataset) -- Dataset configuration object specifying the evaluation dataset
metrics (List[MetricConfig]) -- List of metric configurations for evaluation
llm (Optional[LLM]) -- LLM configuration from orchestration_v2 (LLMModelDetails)
template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template as string, PromptTemplateSpec, or TemplateRef
orchestration_registry_reference (Optional[str]) -- UUID of registered orchestration configuration
template_variable_mapping (Optional[dict]) -- Variable mapping for the prompt template
test_row_count (Optional[int]) -- Number of rows to sample from dataset (-1 for all rows), defaults to -1
repetitions (Optional[int]) -- Number of times to repeat evaluation over the dataset, defaults to 1
tags (Optional[dict]) -- User-defined metadata as key-value pairs, defaults to "{}"
debug_mode (Optional[bool]) -- Enable debug logs in hyperscaler output path, defaults to False

Note

This module uses orchestration_v2 models directly.

Example using TemplateRef with ID:

>>> from gen_ai_hub.evaluations.models import EvaluationConfig, Dataset, MetricConfig
>>> from gen_ai_hub.orchestration_v2.models.llm_model_details import LLMModelDetails as LLM
>>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRef, TemplateRefByID
>>> config = EvaluationConfig(
...     dataset_config=Dataset("data/test.jsonl"),
...     metrics=[MetricConfig(name="accuracy")],
...     llm=LLM(name="gpt-4", version="latest"),
...     template=TemplateRef(template_ref=TemplateRefByID(id="template-id-here")),
...     test_row_count=100
... )

Example using TemplateRef with scenario/name/version:

>>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRefByScenarioNameVersion
>>> config = EvaluationConfig(
...     dataset_config=Dataset("data/test.jsonl"),
...     metrics=[MetricConfig(name="accuracy")],
...     llm=LLM(name="gpt-4", version="latest", params={"temperature": 0.7}),
...     template=TemplateRef(template_ref=TemplateRefByScenarioNameVersion(
...         scenario="foundation-models", name="prompt1", version="1.0"
...     )),
...     test_row_count=100
... )

__init__(dataset_config, metrics, llm=None, template=None, orchestration_registry_reference=None, template_variable_mapping=None, test_row_count=-1, repetitions=1, tags='{}', debug_mode=False)

Initialize an EvaluationConfig instance.

Parameters:

dataset_config (Dataset) -- Dataset configuration object
metrics (List[MetricConfig]) -- List of metric configurations
llm (Optional[LLM]) -- LLM object from orchestration_v2 (LLMModelDetails), defaults to None
template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template (string, PromptTemplateSpec, or TemplateRef), defaults to None
orchestration_registry_reference (Optional[str]) -- UUID of orchestration config, defaults to None
template_variable_mapping (Optional[dict]) -- Variable mapping for prompt template, defaults to None
test_row_count (Optional[int]) -- Number of dataset rows to sample (-1 for all), defaults to -1
repetitions (Optional[int]) -- Number of evaluation repetitions (minimum: 1), defaults to 1
tags (Optional[dict]) -- Key-value metadata pairs applied to all runs, defaults to "{}"
debug_mode (Optional[bool]) -- Enable debug logging, defaults to False

Raises:

ValueError -- If neither (llm, template) nor orchestration_registry_reference is provided

class EvaluationRun

Bases: object

Represents an individual EvaluationRun object and its associated context.

Parameters:

run_id (str) -- Unique identifier for the evaluation run
execution_id (str) -- ID of the AI Core execution
ai_core_client (AICoreV2Client) -- AI Core client instance
configuration_id (str) -- ID of the configuration, defaults to None
artifact_id (str) -- ID of the artifact, defaults to None
resource_group (str) -- Resource group name, defaults to None
object_store_credentials (_AWSObjectStoreData) -- Object store credentials, defaults to None
metrics_list (List[str]) -- List of metrics to evaluate, defaults to None

__init__(run_id, execution_id, ai_core_client, configuration_id=None, artifact_id=None, resource_group=None, object_store_credentials=None, metrics_list=None)

Parameters:

run_id (str)
execution_id (str)
ai_core_client (AICoreV2Client)
configuration_id (str)
artifact_id (str)
resource_group (str)
object_store_credentials (_AWSObjectStoreData)
metrics_list (List[str])

get_current_status()

Get the current status of the evaluation run.

Returns:: Current status of the run
Return type:: Status
Raises:: ValueError -- If failed to retrieve the current status

get_debug_info()

Provide debug information when execution status is FAILED or DEAD.

Returns:: Execution status details including failed pod information
Return type:: ExecutionStatusDetails

get_debug_logs()

Get the complete trace of execution logs.

Returns:: List of log entries as dictionaries
Return type:: list

load_results_tables()

Download results from S3 and load the required table data.

Returns:: Dictionary containing completions and metrics table data
Return type:: dict
Raises:: RuntimeError -- If failed to download results

results()

Get the results of the evaluation run.

Returns:: Results object for accessing completion and metric results
Return type:: Results
Raises:: ValueError -- If execution is not completed

set_cached_results_data(data)

Set the cached results data from the child results class.

Parameters:: data (Any) -- Results data to cache

wait_for_completion(timeout=None)

Wait for the evaluation run to complete by polling status.

Parameters:: timeout (Optional[int]) -- Maximum time to wait in seconds, defaults to 3600 (1 hour)

class MetricConfig

Bases: object

Defines the metric config of the evaluation flow

Parameters:: reference(MetricRef): Provide the reference of metric to be evaluated, can be one of name,uuid(id), scenario/name/version variable_mapping(Optional[dict]): Any variable maping associated with the metric

__init__(reference, variable_mapping=None)

Parameters:

reference (MetricRef)
variable_mapping (dict)

class MetricRef

Bases: object

Represents a reference to a specific metric definition.

A metric can be identified in multiple ways: - By its UUID from metric management service (id) - By name (name) - By a combination of scenario, name, and version (scenario, name, version)

__init__(scenario=None, name=None, version=None, id=None)

Parameters:

scenario (str)
name (str)
version (str)
id (str)

class Results

Bases: object

Represents the Results handler for an EvaluationRun object.

This class provides methods to access completion results, metric results, and aggregated results for a specific evaluation run.

Parameters:: run (EvaluationRun) -- The parent EvaluationRun object

__init__(run)

Parameters:: run (EvaluationRun)

aggregations()

Get the aggregated results for the run from the tracking service.

Returns:: JSON response containing aggregated metric results
Return type:: dict
Raises:: ValueError -- If error occurs while fetching aggregation results

completions()

Get the completion results for the run.

Returns:: DataFrame containing completion results for the run
Return type:: pd.DataFrame
Raises:: ValueError -- If error occurs while fetching completions

metrics()

Get the metric-level results for the run.

Returns:: DataFrame containing metric results for the run
Return type:: pd.DataFrame
Raises:: ValueError -- If error occurs while fetching metric results

Submodules

gen_ai_hub.evaluations.models.artifact_source module

class ArtifactSource

Bases: object

Extends the artifact object with the relative path user can provide inside to be used for EvaluationConfig Example Usage:

>>> ArtifactSource(
        artifact={
            "id": "xyfz-rtyu-2456-ojns-yu6s",
            "name": "dataset-artifact",
            "url": "ai://default/eval_dataset"
            ...
        },
        path= "rootfolder/data.csv,
        file_type="csv"
    )
>>> ArtifactSource(
        artifact="xyfz-rtyu-2456-ojns-yu6s",
        path="rootfolder/data.json,
        file_type="json"
    )
)

__init__(file_type, artifact, path=None)

Parameters:: artifact(Union[str,Artifact]): Can just provide the artifact id as a string or the Artifact object of the AI_API_Client sdk. path(Optional[str]): Relative path within the artifact path provided and should point to a single file. file_type(Literal["csv", "json", "jsonl"]): One of the supported file_types

Parameters:

file_type (Literal['csv', 'json', 'jsonl'])
artifact (str | Artifact)
path (str | None)

gen_ai_hub.evaluations.models.dataset_config module

class Dataset

Bases: object

Dataset object for the evaluations flow.

The Dataset class accepts various source types for evaluation datasets including local file paths (as strings or Path objects) or AI Core artifacts.

Parameters:: source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource

Examples:

Using a Path object:

>>> Dataset(Path("data/sample.json"))

Using a string path:

>>> Dataset("data/sample.json")

Using an ArtifactSource with artifact dictionary:

>>> Dataset(
...     ArtifactSource(
...         artifact={
...             "id": "xyfz-rtyu-2456-ojns-yu6s",
...             "name": "dataset-artifact",
...             "url": "ai://default/eval_dataset"
...         },
...         path="rootfolder/data.csv",
...         file_type="csv"
...     )
... )

Using an ArtifactSource with artifact ID:

>>> Dataset(
...     ArtifactSource(
...         artifact="xyfz-rtyu-2456-ojns-yu6s",
...         path="rootfolder/data.csv",
...         file_type="csv"
...     )
... )

__init__(source)

Initialize a Dataset instance.

Parameters:: source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource

property file_type: str | None

Infer the file type from the source.

For ArtifactSource, returns the explicitly set file_type. For file paths, infers the type from the file extension.

Returns:: File type (e.g., "json", "jsonl", "csv") or None if cannot be determined
Return type:: Optional[str]

gen_ai_hub.evaluations.models.evaluation_config module

class EvaluationConfig

Bases: object

Defines the evaluation configuration object for the Evaluations flow.

This class encapsulates all configuration parameters needed to run an evaluation job, including the model/template configuration, dataset, metrics, and execution settings.

At least one of the following must be provided:

llm and template combination (using orchestration_v2 models)
orchestration_registry_reference (UUID of a registered orchestration configuration)

Parameters:

dataset_config (Dataset) -- Dataset configuration object specifying the evaluation dataset
metrics (List[MetricConfig]) -- List of metric configurations for evaluation
llm (Optional[LLM]) -- LLM configuration from orchestration_v2 (LLMModelDetails)
template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template as string, PromptTemplateSpec, or TemplateRef
orchestration_registry_reference (Optional[str]) -- UUID of registered orchestration configuration
template_variable_mapping (Optional[dict]) -- Variable mapping for the prompt template
test_row_count (Optional[int]) -- Number of rows to sample from dataset (-1 for all rows), defaults to -1
repetitions (Optional[int]) -- Number of times to repeat evaluation over the dataset, defaults to 1
tags (Optional[dict]) -- User-defined metadata as key-value pairs, defaults to "{}"
debug_mode (Optional[bool]) -- Enable debug logs in hyperscaler output path, defaults to False

Note

This module uses orchestration_v2 models directly.

Example using TemplateRef with ID:

>>> from gen_ai_hub.evaluations.models import EvaluationConfig, Dataset, MetricConfig
>>> from gen_ai_hub.orchestration_v2.models.llm_model_details import LLMModelDetails as LLM
>>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRef, TemplateRefByID
>>> config = EvaluationConfig(
...     dataset_config=Dataset("data/test.jsonl"),
...     metrics=[MetricConfig(name="accuracy")],
...     llm=LLM(name="gpt-4", version="latest"),
...     template=TemplateRef(template_ref=TemplateRefByID(id="template-id-here")),
...     test_row_count=100
... )

Example using TemplateRef with scenario/name/version:

>>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRefByScenarioNameVersion
>>> config = EvaluationConfig(
...     dataset_config=Dataset("data/test.jsonl"),
...     metrics=[MetricConfig(name="accuracy")],
...     llm=LLM(name="gpt-4", version="latest", params={"temperature": 0.7}),
...     template=TemplateRef(template_ref=TemplateRefByScenarioNameVersion(
...         scenario="foundation-models", name="prompt1", version="1.0"
...     )),
...     test_row_count=100
... )

Initialize an EvaluationConfig instance.

Parameters:

dataset_config (Dataset) -- Dataset configuration object
metrics (List[MetricConfig]) -- List of metric configurations
llm (Optional[LLM]) -- LLM object from orchestration_v2 (LLMModelDetails), defaults to None
template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template (string, PromptTemplateSpec, or TemplateRef), defaults to None
orchestration_registry_reference (Optional[str]) -- UUID of orchestration config, defaults to None
template_variable_mapping (Optional[dict]) -- Variable mapping for prompt template, defaults to None
test_row_count (Optional[int]) -- Number of dataset rows to sample (-1 for all), defaults to -1
repetitions (Optional[int]) -- Number of evaluation repetitions (minimum: 1), defaults to 1
tags (Optional[dict]) -- Key-value metadata pairs applied to all runs, defaults to "{}"
debug_mode (Optional[bool]) -- Enable debug logging, defaults to False

Raises:

ValueError -- If neither (llm, template) nor orchestration_registry_reference is provided

gen_ai_hub.evaluations.models.evaluation_run module

class EvaluationRun

Bases: object

Represents an individual EvaluationRun object and its associated context.

Parameters:

run_id (str) -- Unique identifier for the evaluation run
execution_id (str) -- ID of the AI Core execution
ai_core_client (AICoreV2Client) -- AI Core client instance
configuration_id (str) -- ID of the configuration, defaults to None
artifact_id (str) -- ID of the artifact, defaults to None
resource_group (str) -- Resource group name, defaults to None
object_store_credentials (_AWSObjectStoreData) -- Object store credentials, defaults to None
metrics_list (List[str]) -- List of metrics to evaluate, defaults to None

__init__(run_id, execution_id, ai_core_client, configuration_id=None, artifact_id=None, resource_group=None, object_store_credentials=None, metrics_list=None)

Parameters:

run_id (str)
execution_id (str)
ai_core_client (AICoreV2Client)
configuration_id (str)
artifact_id (str)
resource_group (str)
object_store_credentials (_AWSObjectStoreData)
metrics_list (List[str])

get_current_status()

Get the current status of the evaluation run.

Returns:: Current status of the run
Return type:: Status
Raises:: ValueError -- If failed to retrieve the current status

get_debug_info()

Provide debug information when execution status is FAILED or DEAD.

Returns:: Execution status details including failed pod information
Return type:: ExecutionStatusDetails

get_debug_logs()

Get the complete trace of execution logs.

Returns:: List of log entries as dictionaries
Return type:: list

load_results_tables()

Download results from S3 and load the required table data.

Returns:: Dictionary containing completions and metrics table data
Return type:: dict
Raises:: RuntimeError -- If failed to download results

results()

Get the results of the evaluation run.

Returns:: Results object for accessing completion and metric results
Return type:: Results
Raises:: ValueError -- If execution is not completed

set_cached_results_data(data)

Set the cached results data from the child results class.

Parameters:: data (Any) -- Results data to cache

wait_for_completion(timeout=None)

Wait for the evaluation run to complete by polling status.

Parameters:: timeout (Optional[int]) -- Maximum time to wait in seconds, defaults to 3600 (1 hour)

class Results

Bases: object

Represents the Results handler for an EvaluationRun object.

This class provides methods to access completion results, metric results, and aggregated results for a specific evaluation run.

Parameters:: run (EvaluationRun) -- The parent EvaluationRun object

__init__(run)

Parameters:: run (EvaluationRun)

aggregations()

Get the aggregated results for the run from the tracking service.

Returns:: JSON response containing aggregated metric results
Return type:: dict
Raises:: ValueError -- If error occurs while fetching aggregation results

completions()

Get the completion results for the run.

Returns:: DataFrame containing completion results for the run
Return type:: pd.DataFrame
Raises:: ValueError -- If error occurs while fetching completions

metrics()

Get the metric-level results for the run.

Returns:: DataFrame containing metric results for the run
Return type:: pd.DataFrame
Raises:: ValueError -- If error occurs while fetching metric results

gen_ai_hub.evaluations.models.metric_config module

class MetricConfig

Bases: object

Defines the metric config of the evaluation flow

Parameters:: reference(MetricRef): Provide the reference of metric to be evaluated, can be one of name,uuid(id), scenario/name/version variable_mapping(Optional[dict]): Any variable maping associated with the metric

__init__(reference, variable_mapping=None)

Parameters:

reference (MetricRef)
variable_mapping (dict)

class MetricRef

Bases: object

Represents a reference to a specific metric definition.

A metric can be identified in multiple ways: - By its UUID from metric management service (id) - By name (name) - By a combination of scenario, name, and version (scenario, name, version)

__init__(scenario=None, name=None, version=None, id=None)

Parameters:

scenario (str)
name (str)
version (str)
id (str)