gen_ai_hub.evaluations.models package

Submodules

gen_ai_hub.evaluations.models.artifact_source module

class ArtifactSource

Bases: object

Extends the artifact object with the relative path user can provide inside to be used for EvaluationConfig Example Usage:

>>> ArtifactSource(
        artifact={
            "id": "xyfz-rtyu-2456-ojns-yu6s",
            "name": "dataset-artifact",
            "url": "ai://default/eval_dataset"
            ...
        },
        path= "rootfolder/data.csv,
        file_type="csv"
    )
>>> ArtifactSource(
        artifact="xyfz-rtyu-2456-ojns-yu6s",
        path="rootfolder/data.json,
        file_type="json"
    )
)
__init__(file_type, artifact, path=None)
Parameters:

artifact(Union[str,Artifact]): Can just provide the artifact id as a string or the Artifact object of the AI_API_Client sdk. path(Optional[str]): Relative path within the artifact path provided and should point to a single file. file_type(Literal["csv", "json", "jsonl"]): One of the supported file_types

Parameters:
  • file_type (Literal['csv', 'json', 'jsonl'])

  • artifact (str | Artifact)

  • path (str | None)

gen_ai_hub.evaluations.models.dataset_config module

class Dataset

Bases: object

Dataset object for the evaluations flow.

The Dataset class accepts various source types for evaluation datasets including local file paths (as strings or Path objects) or AI Core artifacts.

Parameters:

source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource

Examples:

Using a Path object:

>>> Dataset(Path("data/sample.json"))

Using a string path:

>>> Dataset("data/sample.json")

Using an ArtifactSource with artifact dictionary:

>>> Dataset(
...     ArtifactSource(
...         artifact={
...             "id": "xyfz-rtyu-2456-ojns-yu6s",
...             "name": "dataset-artifact",
...             "url": "ai://default/eval_dataset"
...         },
...         path="rootfolder/data.csv",
...         file_type="csv"
...     )
... )

Using an ArtifactSource with artifact ID:

>>> Dataset(
...     ArtifactSource(
...         artifact="xyfz-rtyu-2456-ojns-yu6s",
...         path="rootfolder/data.csv",
...         file_type="csv"
...     )
... )
__init__(source)

Initialize a Dataset instance.

Parameters:

source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource

property file_type: str | None

Infer the file type from the source.

For ArtifactSource, returns the explicitly set file_type. For file paths, infers the type from the file extension.

Returns:

File type (e.g., "json", "jsonl", "csv") or None if cannot be determined

Return type:

Optional[str]

gen_ai_hub.evaluations.models.evaluation_config module

class EvaluationConfig

Bases: object

Defines the evaluation configuration object for the Evaluations flow.

This class encapsulates all configuration parameters needed to run an evaluation job, including the model/template configuration, dataset, metrics, and execution settings.

At least one of the following must be provided:

  • llm and template combination (using orchestration_v2 models)

  • orchestration_registry_reference (UUID of a registered orchestration configuration)

Parameters:
  • dataset_config (Dataset) -- Dataset configuration object specifying the evaluation dataset

  • metrics (List[MetricConfig]) -- List of metric configurations for evaluation

  • llm (Optional[LLM]) -- LLM configuration from orchestration_v2 (LLMModelDetails)

  • template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template as string, PromptTemplateSpec, or TemplateRef

  • orchestration_registry_reference (Optional[str]) -- UUID of registered orchestration configuration

  • template_variable_mapping (Optional[dict]) -- Variable mapping for the prompt template

  • test_row_count (Optional[int]) -- Number of rows to sample from dataset (-1 for all rows), defaults to -1

  • repetitions (Optional[int]) -- Number of times to repeat evaluation over the dataset, defaults to 1

  • tags (Optional[dict]) -- User-defined metadata as key-value pairs, defaults to "{}"

  • debug_mode (Optional[bool]) -- Enable debug logs in hyperscaler output path, defaults to False

Note

This module uses orchestration_v2 models directly.

Example using TemplateRef with ID:

>>> from gen_ai_hub.evaluations.models import EvaluationConfig, Dataset, MetricConfig
>>> from gen_ai_hub.orchestration_v2.models.llm_model_details import LLMModelDetails as LLM
>>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRef, TemplateRefByID
>>> config = EvaluationConfig(
...     dataset_config=Dataset("data/test.jsonl"),
...     metrics=[MetricConfig(name="accuracy")],
...     llm=LLM(name="gpt-4", version="latest"),
...     template=TemplateRef(template_ref=TemplateRefByID(id="template-id-here")),
...     test_row_count=100
... )

Example using TemplateRef with scenario/name/version:

>>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRefByScenarioNameVersion
>>> config = EvaluationConfig(
...     dataset_config=Dataset("data/test.jsonl"),
...     metrics=[MetricConfig(name="accuracy")],
...     llm=LLM(name="gpt-4", version="latest", params={"temperature": 0.7}),
...     template=TemplateRef(template_ref=TemplateRefByScenarioNameVersion(
...         scenario="foundation-models", name="prompt1", version="1.0"
...     )),
...     test_row_count=100
... )
__init__(dataset_config, metrics, llm=None, template=None, orchestration_registry_reference=None, template_variable_mapping=None, test_row_count=-1, repetitions=1, tags='{}', debug_mode=False)

Initialize an EvaluationConfig instance.

Parameters:
  • dataset_config (Dataset) -- Dataset configuration object

  • metrics (List[MetricConfig]) -- List of metric configurations

  • llm (Optional[LLM]) -- LLM object from orchestration_v2 (LLMModelDetails), defaults to None

  • template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template (string, PromptTemplateSpec, or TemplateRef), defaults to None

  • orchestration_registry_reference (Optional[str]) -- UUID of orchestration config, defaults to None

  • template_variable_mapping (Optional[dict]) -- Variable mapping for prompt template, defaults to None

  • test_row_count (Optional[int]) -- Number of dataset rows to sample (-1 for all), defaults to -1

  • repetitions (Optional[int]) -- Number of evaluation repetitions (minimum: 1), defaults to 1

  • tags (Optional[dict]) -- Key-value metadata pairs applied to all runs, defaults to "{}"

  • debug_mode (Optional[bool]) -- Enable debug logging, defaults to False

Raises:

ValueError -- If neither (llm, template) nor orchestration_registry_reference is provided

gen_ai_hub.evaluations.models.evaluation_run module

class EvaluationRun

Bases: object

Represents an individual EvaluationRun object and its associated context.

Parameters:
  • run_id (str) -- Unique identifier for the evaluation run

  • execution_id (str) -- ID of the AI Core execution

  • ai_core_client (AICoreV2Client) -- AI Core client instance

  • configuration_id (str) -- ID of the configuration, defaults to None

  • artifact_id (str) -- ID of the artifact, defaults to None

  • resource_group (str) -- Resource group name, defaults to None

  • object_store_credentials (_AWSObjectStoreData) -- Object store credentials, defaults to None

  • metrics_list (List[str]) -- List of metrics to evaluate, defaults to None

__init__(run_id, execution_id, ai_core_client, configuration_id=None, artifact_id=None, resource_group=None, object_store_credentials=None, metrics_list=None)
Parameters:
  • run_id (str)

  • execution_id (str)

  • ai_core_client (AICoreV2Client)

  • configuration_id (str)

  • artifact_id (str)

  • resource_group (str)

  • object_store_credentials (_AWSObjectStoreData)

  • metrics_list (List[str])

get_current_status()

Get the current status of the evaluation run.

Returns:

Current status of the run

Return type:

Status

Raises:

ValueError -- If failed to retrieve the current status

get_debug_info()

Provide debug information when execution status is FAILED or DEAD.

Returns:

Execution status details including failed pod information

Return type:

ExecutionStatusDetails

get_debug_logs()

Get the complete trace of execution logs.

Returns:

List of log entries as dictionaries

Return type:

list

load_results_tables()

Download results from S3 and load the required table data.

Returns:

Dictionary containing completions and metrics table data

Return type:

dict

Raises:

RuntimeError -- If failed to download results

results()

Get the results of the evaluation run.

Returns:

Results object for accessing completion and metric results

Return type:

Results

Raises:

ValueError -- If execution is not completed

set_cached_results_data(data)

Set the cached results data from the child results class.

Parameters:

data (Any) -- Results data to cache

wait_for_completion(timeout=None)

Wait for the evaluation run to complete by polling status.

Parameters:

timeout (Optional[int]) -- Maximum time to wait in seconds, defaults to 3600 (1 hour)

class ExecutionStatusDetails

Bases: object

Dataclass for execution status details.

Parameters:
  • details (Any) -- Detailed information about the execution status

  • status (Any) -- Current status of the execution

__init__(details, status)
Parameters:
  • details (Any)

  • status (Any)

Return type:

None

details: Any
status: Any
class Results

Bases: object

Represents the Results handler for an EvaluationRun object.

This class provides methods to access completion results, metric results, and aggregated results for a specific evaluation run.

Parameters:

run (EvaluationRun) -- The parent EvaluationRun object

__init__(run)
Parameters:

run (EvaluationRun)

aggregations()

Get the aggregated results for the run from the tracking service.

Returns:

JSON response containing aggregated metric results

Return type:

dict

Raises:

ValueError -- If error occurs while fetching aggregation results

completions()

Get the completion results for the run.

Returns:

DataFrame containing completion results for the run

Return type:

pd.DataFrame

Raises:

ValueError -- If error occurs while fetching completions

metrics()

Get the metric-level results for the run.

Returns:

DataFrame containing metric results for the run

Return type:

pd.DataFrame

Raises:

ValueError -- If error occurs while fetching metric results

configure_pandas_display()

gen_ai_hub.evaluations.models.metric_config module

class MetricConfig

Bases: object

Defines the metric config of the evaluation flow

Parameters:

reference(MetricRef): Provide the reference of metric to be evaluated, can be one of name,uuid(id), scenario/name/version variable_mapping(Optional[dict]): Any variable maping associated with the metric

__init__(reference, variable_mapping=None)
Parameters:
  • reference (MetricRef)

  • variable_mapping (dict)

class MetricRef

Bases: object

Represents a reference to a specific metric definition.

A metric can be identified in multiple ways: - By its UUID from metric management service (id) - By name (name) - By a combination of scenario, name, and version (scenario, name, version)

__init__(scenario=None, name=None, version=None, id=None)
Parameters:
  • scenario (str)

  • name (str)

  • version (str)

  • id (str)