gen_ai_hub.evaluations package

class ArtifactSource

Bases: object

Extends the artifact object with the relative path user can provide inside to be used for EvaluationConfig Example Usage:

>>> ArtifactSource(
        artifact={
            "id": "xyfz-rtyu-2456-ojns-yu6s",
            "name": "dataset-artifact",
            "url": "ai://default/eval_dataset"
            ...
        },
        path= "rootfolder/data.csv,
        file_type="csv"
    )
>>> ArtifactSource(
        artifact="xyfz-rtyu-2456-ojns-yu6s",
        path="rootfolder/data.json,
        file_type="json"
    )
)

__init__(file_type, artifact, path=None)

Parameters:: artifact(Union[str,Artifact]): Can just provide the artifact id as a string or the Artifact object of the AI_API_Client sdk. path(Optional[str]): Relative path within the artifact path provided and should point to a single file. file_type(Literal["csv", "json", "jsonl"]): One of the supported file_types

Parameters:

file_type (Literal['csv', 'json', 'jsonl'])
artifact (str | Artifact)
path (str | None)

class Dataset

Bases: object

Dataset object for the evaluations flow.

The Dataset class accepts various source types for evaluation datasets including local file paths (as strings or Path objects) or AI Core artifacts.

Parameters:: source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource

Examples:

Using a Path object:

>>> Dataset(Path("data/sample.json"))

Using a string path:

>>> Dataset("data/sample.json")

Using an ArtifactSource with artifact dictionary:

>>> Dataset(
...     ArtifactSource(
...         artifact={
...             "id": "xyfz-rtyu-2456-ojns-yu6s",
...             "name": "dataset-artifact",
...             "url": "ai://default/eval_dataset"
...         },
...         path="rootfolder/data.csv",
...         file_type="csv"
...     )
... )

Using an ArtifactSource with artifact ID:

>>> Dataset(
...     ArtifactSource(
...         artifact="xyfz-rtyu-2456-ojns-yu6s",
...         path="rootfolder/data.csv",
...         file_type="csv"
...     )
... )

__init__(source)

Initialize a Dataset instance.

Parameters:: source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource

property file_type: str | None

Infer the file type from the source.

For ArtifactSource, returns the explicitly set file_type. For file paths, infers the type from the file extension.

Returns:: File type (e.g., "json", "jsonl", "csv") or None if cannot be determined
Return type:: Optional[str]

class EvaluationClient

Bases: object

Base Client for the Evaluations service

static from_env(profile_name=None, **kwargs)

Alternative way to create an EvaluationClient object.

Parameter resolution precedence: 1. Explicit keyword arguments 2. Environment variables 3. Configuration file 4. VCAP_SERVICES environment variable

Parameters:

profile_name (str, optional) -- Profile name defined in configuration.
kwargs -- Additional parameters passed to constructor.

Returns:

Configured EvaluationClient instance.

Return type:

EvaluationClient

__init__(base_url, auth_url=None, client_id=None, client_secret=None, cert_str=None, key_str=None, cert_file_path=None, key_file_path=None, resource_group=None, aws_access_key_id=None, aws_secret_access_key=None, ai_core_client=None, orchestration_url=None, input_object_store_secret_name=None, provider_name='aws')

EvaluationsClient root object to be used for Evaluations.

Parameters:

base_url (str) -- Base URL of the AI Core instance (must include /v2 suffix).
auth_url (str, optional) -- Authentication URL used to retrieve access tokens.
client_id (str, optional) -- OAuth client ID.
client_secret (str, optional) -- OAuth client secret.
cert_str (str, optional) -- X.509 certificate content as a string.
key_str (str, optional) -- X.509 private key content as a string.
cert_file_path (str, optional) -- File path to X.509 certificate.
key_file_path (str, optional) -- File path to X.509 private key.
resource_group (str, optional) -- Resource group name within the AI Core instance.
aws_access_key_id (str, optional) -- AWS access key ID.
aws_secret_access_key (str, optional) -- AWS secret access key.
ai_core_client (AICoreV2Client, optional) -- Pre-configured AI Core client instance.
orchestration_url (str, optional) -- Pre-existing orchestration deployment URL.
input_object_store_secret_name (str, optional) -- Name of input object store secret.
provider_name (str, optional) -- Hyperscaler provider name (e.g., "aws").

Raises:

ValueError -- If required hyperscaler provider parameters are missing.

create_or_update_object_store_secret(*, context, secret_body, is_default, result_key, attr_name, creator_mapping, replace_existing, result)

Parameters:

secret_body (dict)
is_default (bool)
result_key (str)
attr_name (str)
creator_mapping (dict)
replace_existing (bool)
result (dict)

evaluate(evaluation_configs)

Main evaluate function to create the Evaluation job

Parameters:: evaluation_configs(List[EvaluationConfig]): A list of one or more of the EvaluationConfig objects
Returns:: List[EvaluationRun]: A list of EvaluationRun objects, one for each EvaluationConfig provided.

Parameters:: evaluation_configs (List[EvaluationConfig])
Return type:: List[EvaluationRun]

get_system_supported_metrics()

helper method to get the list of all supported metric ids

Return type:: List[str]

list_available_models(): Method to list all the available llm models

resolve_orchestration_deployment_url()

Resolves the orchestration deployment URL.

For non-default resource groups, creates a new deployment. For default resource group, attempts to discover existing deployment with the default config name using the orchestration service, or creates one if not found.

Returns:: The orchestration deployment URL.
Return type:: str

setup(input_secret_body=None, default_secret_body=None, replace_existing=False)

One time setup function which does object store secrets creation and orchestration deployment url creation if not provided.

Parameters:

input_secret_body (dict | None)
default_secret_body (dict | None)
replace_existing (bool)

validate_secret_type(secret_type, creator_mapping)

Parameters:

secret_type (str)
creator_mapping (dict)

class EvaluationConfig

Bases: object

Defines the evaluation configuration object for the Evaluations flow.

This class encapsulates all configuration parameters needed to run an evaluation job, including the model/template configuration, dataset, metrics, and execution settings.

At least one of the following must be provided:

llm and template combination (using orchestration_v2 models)
orchestration_registry_reference (UUID of a registered orchestration configuration)

Parameters:

dataset_config (Dataset) -- Dataset configuration object specifying the evaluation dataset
metrics (List[MetricConfig]) -- List of metric configurations for evaluation
llm (Optional[LLM]) -- LLM configuration from orchestration_v2 (LLMModelDetails)
template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template as string, PromptTemplateSpec, or TemplateRef
orchestration_registry_reference (Optional[str]) -- UUID of registered orchestration configuration
template_variable_mapping (Optional[dict]) -- Variable mapping for the prompt template
test_row_count (Optional[int]) -- Number of rows to sample from dataset (-1 for all rows), defaults to -1
repetitions (Optional[int]) -- Number of times to repeat evaluation over the dataset, defaults to 1
tags (Optional[dict]) -- User-defined metadata as key-value pairs, defaults to "{}"
debug_mode (Optional[bool]) -- Enable debug logs in hyperscaler output path, defaults to False

Note

This module uses orchestration_v2 models directly.

Example using TemplateRef with ID:

>>> from gen_ai_hub.evaluations.models import EvaluationConfig, Dataset, MetricConfig
>>> from gen_ai_hub.orchestration_v2.models.llm_model_details import LLMModelDetails as LLM
>>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRef, TemplateRefByID
>>> config = EvaluationConfig(
...     dataset_config=Dataset("data/test.jsonl"),
...     metrics=[MetricConfig(name="accuracy")],
...     llm=LLM(name="gpt-4", version="latest"),
...     template=TemplateRef(template_ref=TemplateRefByID(id="template-id-here")),
...     test_row_count=100
... )

Example using TemplateRef with scenario/name/version:

>>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRefByScenarioNameVersion
>>> config = EvaluationConfig(
...     dataset_config=Dataset("data/test.jsonl"),
...     metrics=[MetricConfig(name="accuracy")],
...     llm=LLM(name="gpt-4", version="latest", params={"temperature": 0.7}),
...     template=TemplateRef(template_ref=TemplateRefByScenarioNameVersion(
...         scenario="foundation-models", name="prompt1", version="1.0"
...     )),
...     test_row_count=100
... )

__init__(dataset_config, metrics, llm=None, template=None, orchestration_registry_reference=None, template_variable_mapping=None, test_row_count=-1, repetitions=1, tags='{}', debug_mode=False)

Initialize an EvaluationConfig instance.

Parameters:

dataset_config (Dataset) -- Dataset configuration object
metrics (List[MetricConfig]) -- List of metric configurations
llm (Optional[LLM]) -- LLM object from orchestration_v2 (LLMModelDetails), defaults to None
template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template (string, PromptTemplateSpec, or TemplateRef), defaults to None
orchestration_registry_reference (Optional[str]) -- UUID of orchestration config, defaults to None
template_variable_mapping (Optional[dict]) -- Variable mapping for prompt template, defaults to None
test_row_count (Optional[int]) -- Number of dataset rows to sample (-1 for all), defaults to -1
repetitions (Optional[int]) -- Number of evaluation repetitions (minimum: 1), defaults to 1
tags (Optional[dict]) -- Key-value metadata pairs applied to all runs, defaults to "{}"
debug_mode (Optional[bool]) -- Enable debug logging, defaults to False

Raises:

ValueError -- If neither (llm, template) nor orchestration_registry_reference is provided

class EvaluationRun

Bases: object

Represents an individual EvaluationRun object and its associated context.

Parameters:

run_id (str) -- Unique identifier for the evaluation run
execution_id (str) -- ID of the AI Core execution
ai_core_client (AICoreV2Client) -- AI Core client instance
configuration_id (str) -- ID of the configuration, defaults to None
artifact_id (str) -- ID of the artifact, defaults to None
resource_group (str) -- Resource group name, defaults to None
object_store_credentials (_AWSObjectStoreData) -- Object store credentials, defaults to None
metrics_list (List[str]) -- List of metrics to evaluate, defaults to None

__init__(run_id, execution_id, ai_core_client, configuration_id=None, artifact_id=None, resource_group=None, object_store_credentials=None, metrics_list=None)

Parameters:

run_id (str)
execution_id (str)
ai_core_client (AICoreV2Client)
configuration_id (str)
artifact_id (str)
resource_group (str)
object_store_credentials (_AWSObjectStoreData)
metrics_list (List[str])

get_current_status()

Get the current status of the evaluation run.

Returns:: Current status of the run
Return type:: Status
Raises:: ValueError -- If failed to retrieve the current status

get_debug_info()

Provide debug information when execution status is FAILED or DEAD.

Returns:: Execution status details including failed pod information
Return type:: ExecutionStatusDetails

get_debug_logs()

Get the complete trace of execution logs.

Returns:: List of log entries as dictionaries
Return type:: list

load_results_tables()

Download results from S3 and load the required table data.

Returns:: Dictionary containing completions and metrics table data
Return type:: dict
Raises:: RuntimeError -- If failed to download results

results()

Get the results of the evaluation run.

Returns:: Results object for accessing completion and metric results
Return type:: Results
Raises:: ValueError -- If execution is not completed

set_cached_results_data(data)

Set the cached results data from the child results class.

Parameters:: data (Any) -- Results data to cache

wait_for_completion(timeout=None)

Wait for the evaluation run to complete by polling status.

Parameters:: timeout (Optional[int]) -- Maximum time to wait in seconds, defaults to 3600 (1 hour)

class MetricConfig

Bases: object

Defines the metric config of the evaluation flow

Parameters:: reference(MetricRef): Provide the reference of metric to be evaluated, can be one of name,uuid(id), scenario/name/version variable_mapping(Optional[dict]): Any variable maping associated with the metric

__init__(reference, variable_mapping=None)

Parameters:

reference (MetricRef)
variable_mapping (dict)

class MetricRef

Bases: object

Represents a reference to a specific metric definition.

A metric can be identified in multiple ways: - By its UUID from metric management service (id) - By name (name) - By a combination of scenario, name, and version (scenario, name, version)

__init__(scenario=None, name=None, version=None, id=None)

Parameters:

scenario (str)
name (str)
version (str)
id (str)

class Results

Bases: object

Represents the Results handler for an EvaluationRun object.

This class provides methods to access completion results, metric results, and aggregated results for a specific evaluation run.

Parameters:: run (EvaluationRun) -- The parent EvaluationRun object

__init__(run)

Parameters:: run (EvaluationRun)

aggregations()

Get the aggregated results for the run from the tracking service.

Returns:: JSON response containing aggregated metric results
Return type:: dict
Raises:: ValueError -- If error occurs while fetching aggregation results

completions()

Get the completion results for the run.

Returns:: DataFrame containing completion results for the run
Return type:: pd.DataFrame
Raises:: ValueError -- If error occurs while fetching completions

metrics()

Get the metric-level results for the run.

Returns:: DataFrame containing metric results for the run
Return type:: pd.DataFrame
Raises:: ValueError -- If error occurs while fetching metric results

Subpackages

Submodules

gen_ai_hub.evaluations.client module

class EvaluationClient

Bases: object

Base Client for the Evaluations service

static from_env(profile_name=None, **kwargs)

Alternative way to create an EvaluationClient object.

Parameter resolution precedence: 1. Explicit keyword arguments 2. Environment variables 3. Configuration file 4. VCAP_SERVICES environment variable

Parameters:

profile_name (str, optional) -- Profile name defined in configuration.
kwargs -- Additional parameters passed to constructor.

Returns:

Configured EvaluationClient instance.

Return type:

EvaluationClient

EvaluationsClient root object to be used for Evaluations.

Parameters:

base_url (str) -- Base URL of the AI Core instance (must include /v2 suffix).
auth_url (str, optional) -- Authentication URL used to retrieve access tokens.
client_id (str, optional) -- OAuth client ID.
client_secret (str, optional) -- OAuth client secret.
cert_str (str, optional) -- X.509 certificate content as a string.
key_str (str, optional) -- X.509 private key content as a string.
cert_file_path (str, optional) -- File path to X.509 certificate.
key_file_path (str, optional) -- File path to X.509 private key.
resource_group (str, optional) -- Resource group name within the AI Core instance.
aws_access_key_id (str, optional) -- AWS access key ID.
aws_secret_access_key (str, optional) -- AWS secret access key.
ai_core_client (AICoreV2Client, optional) -- Pre-configured AI Core client instance.
orchestration_url (str, optional) -- Pre-existing orchestration deployment URL.
input_object_store_secret_name (str, optional) -- Name of input object store secret.
provider_name (str, optional) -- Hyperscaler provider name (e.g., "aws").

Raises:

ValueError -- If required hyperscaler provider parameters are missing.

create_or_update_object_store_secret(*, context, secret_body, is_default, result_key, attr_name, creator_mapping, replace_existing, result)

Parameters:

secret_body (dict)
is_default (bool)
result_key (str)
attr_name (str)
creator_mapping (dict)
replace_existing (bool)
result (dict)

evaluate(evaluation_configs)

Main evaluate function to create the Evaluation job

Parameters:: evaluation_configs(List[EvaluationConfig]): A list of one or more of the EvaluationConfig objects
Returns:: List[EvaluationRun]: A list of EvaluationRun objects, one for each EvaluationConfig provided.

Parameters:: evaluation_configs (List[EvaluationConfig])
Return type:: List[EvaluationRun]

get_system_supported_metrics()

helper method to get the list of all supported metric ids

Return type:: List[str]

list_available_models(): Method to list all the available llm models

resolve_orchestration_deployment_url()

Resolves the orchestration deployment URL.

Returns:: The orchestration deployment URL.
Return type:: str

setup(input_secret_body=None, default_secret_body=None, replace_existing=False)

One time setup function which does object store secrets creation and orchestration deployment url creation if not provided.

Parameters:

input_secret_body (dict | None)
default_secret_body (dict | None)
replace_existing (bool)

validate_secret_type(secret_type, creator_mapping)

Parameters:

secret_type (str)
creator_mapping (dict)

gen_ai_hub.evaluations.constants module

gen_ai_hub.evaluations.credentials module

class CredentialsValue

Bases: object

CredentialsValue(name: 'str', vcap_key: 'Optional[Tuple[str, ...]]' = None, transform_fn: 'Optional[Callable]' = None)

__init__(name, vcap_key=None, transform_fn=None)

Parameters:

name (str)
vcap_key (Tuple[str, ...] | None)
transform_fn (Callable | None)

Return type:

None

name: str

transform_fn: Callable | None = None

vcap_key: Tuple[str, ...] | None = None

class Service

Bases: object

__init__(env)

Parameters:: env (Dict[str, Any])

get(key, default=<object object>)

property label: str | None

property name: str | None

class Source

Bases: object

Source(name: 'str', get: 'Callable[[CredentialsValue], Optional[str]]')

__init__(name, get)

Parameters:

name (str)
get (Callable[[CredentialsValue], str | None])

Return type:

None

get: Callable[[CredentialsValue], str | None]

name: str

class VCAPEnvironment

Bases: object

VCAPEnvironment(services: 'List[Service]')

classmethod from_dict(env)

Parameters:: env (Dict[str, Any])

classmethod from_env(env_var=None)

Parameters:: env_var (str | None)

__init__(services)

Parameters:: services (List[Service])
Return type:: None

get_service(label, exactly_one=True)

Parameters:: exactly_one (bool)
Return type:: Service

get_service_by_name(name, exactly_one=True)

Parameters:: exactly_one (bool)
Return type:: Service

services: List[Service]

extract_credentials(source, exclude=None)

Extract all credentials from a source.

Parameters:

source (Source)
exclude (List[str])

Return type:

Dict[str, str]

fetch_credentials(profile=None, **kwargs)

Fetch credentials from a single source based on precedence.

Precedence order: kwargs > environment variables > config file > VCAP service

Once a source is selected (first one with any credential), all credentials come from that source only. Resource group is an exception and follows precedence independently.

Parameters:: profile (str)
Return type:: Dict[str, str]

get_home()

Return type:: str

get_nested_value(data_dict, keys)

Retrieve a nested value from a dictionary using a list of strings.

Parameters:

data_dict -- The dictionary to search.
keys (List[str]) -- A list of strings representing nested keys.

Returns:

The value associated with the nested keys, or None if not found.

init_conf(profile=None)

Parameters:: profile (str)

resolve_credentials(sources)

Extract credentials from the first source that has any defined.

Parameters:: sources (List[Source])
Return type:: Dict[str, str]

resolve_resource_group(sources)

Find resource_group from the first source that defines it.

Parameters:: sources (List[Source])
Return type:: str | None

validate_credentials(credentials)

Validate that we have a complete authentication method.

Parameters:: credentials (Dict[str, str])
Return type:: None