gen_ai_hub.evaluations.models package
Submodules
gen_ai_hub.evaluations.models.artifact_source module
- class ArtifactSource
Bases:
objectExtends the artifact object with the relative path user can provide inside to be used for EvaluationConfig Example Usage:
>>> ArtifactSource( artifact={ "id": "xyfz-rtyu-2456-ojns-yu6s", "name": "dataset-artifact", "url": "ai://default/eval_dataset" ... }, path= "rootfolder/data.csv, file_type="csv" ) >>> ArtifactSource( artifact="xyfz-rtyu-2456-ojns-yu6s", path="rootfolder/data.json, file_type="json" ) )- __init__(file_type, artifact, path=None)
- Parameters:
artifact(Union[str,Artifact]): Can just provide the artifact id as a string or the Artifact object of the AI_API_Client sdk. path(Optional[str]): Relative path within the artifact path provided and should point to a single file. file_type(Literal["csv", "json", "jsonl"]): One of the supported file_types
- Parameters:
file_type (Literal['csv', 'json', 'jsonl'])
artifact (str | Artifact)
path (str | None)
gen_ai_hub.evaluations.models.dataset_config module
- class Dataset
Bases:
objectDataset object for the evaluations flow.
The Dataset class accepts various source types for evaluation datasets including local file paths (as strings or Path objects) or AI Core artifacts.
- Parameters:
source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource
Examples:
Using a Path object:
>>> Dataset(Path("data/sample.json"))Using a string path:
>>> Dataset("data/sample.json")Using an ArtifactSource with artifact dictionary:
>>> Dataset( ... ArtifactSource( ... artifact={ ... "id": "xyfz-rtyu-2456-ojns-yu6s", ... "name": "dataset-artifact", ... "url": "ai://default/eval_dataset" ... }, ... path="rootfolder/data.csv", ... file_type="csv" ... ) ... )Using an ArtifactSource with artifact ID:
>>> Dataset( ... ArtifactSource( ... artifact="xyfz-rtyu-2456-ojns-yu6s", ... path="rootfolder/data.csv", ... file_type="csv" ... ) ... )- __init__(source)
Initialize a Dataset instance.
- Parameters:
source (Union[str, Path, ArtifactSource]) -- Source of the dataset - can be a file path string, Path object, or ArtifactSource
- property file_type: str | None
Infer the file type from the source.
For ArtifactSource, returns the explicitly set file_type. For file paths, infers the type from the file extension.
- Returns:
File type (e.g., "json", "jsonl", "csv") or None if cannot be determined
- Return type:
Optional[str]
gen_ai_hub.evaluations.models.evaluation_config module
- class EvaluationConfig
Bases:
objectDefines the evaluation configuration object for the Evaluations flow.
This class encapsulates all configuration parameters needed to run an evaluation job, including the model/template configuration, dataset, metrics, and execution settings.
At least one of the following must be provided:
llmandtemplatecombination (using orchestration_v2 models)orchestration_registry_reference(UUID of a registered orchestration configuration)
- Parameters:
dataset_config (Dataset) -- Dataset configuration object specifying the evaluation dataset
metrics (List[MetricConfig]) -- List of metric configurations for evaluation
llm (Optional[LLM]) -- LLM configuration from orchestration_v2 (LLMModelDetails)
template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template as string, PromptTemplateSpec, or TemplateRef
orchestration_registry_reference (Optional[str]) -- UUID of registered orchestration configuration
template_variable_mapping (Optional[dict]) -- Variable mapping for the prompt template
test_row_count (Optional[int]) -- Number of rows to sample from dataset (-1 for all rows), defaults to -1
repetitions (Optional[int]) -- Number of times to repeat evaluation over the dataset, defaults to 1
tags (Optional[dict]) -- User-defined metadata as key-value pairs, defaults to "{}"
debug_mode (Optional[bool]) -- Enable debug logs in hyperscaler output path, defaults to False
Note
This module uses orchestration_v2 models directly.
Example using TemplateRef with ID:
>>> from gen_ai_hub.evaluations.models import EvaluationConfig, Dataset, MetricConfig >>> from gen_ai_hub.orchestration_v2.models.llm_model_details import LLMModelDetails as LLM >>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRef, TemplateRefByID >>> config = EvaluationConfig( ... dataset_config=Dataset("data/test.jsonl"), ... metrics=[MetricConfig(name="accuracy")], ... llm=LLM(name="gpt-4", version="latest"), ... template=TemplateRef(template_ref=TemplateRefByID(id="template-id-here")), ... test_row_count=100 ... )Example using TemplateRef with scenario/name/version:
>>> from gen_ai_hub.orchestration_v2.models.template_ref import TemplateRefByScenarioNameVersion >>> config = EvaluationConfig( ... dataset_config=Dataset("data/test.jsonl"), ... metrics=[MetricConfig(name="accuracy")], ... llm=LLM(name="gpt-4", version="latest", params={"temperature": 0.7}), ... template=TemplateRef(template_ref=TemplateRefByScenarioNameVersion( ... scenario="foundation-models", name="prompt1", version="1.0" ... )), ... test_row_count=100 ... )- __init__(dataset_config, metrics, llm=None, template=None, orchestration_registry_reference=None, template_variable_mapping=None, test_row_count=-1, repetitions=1, tags='{}', debug_mode=False)
Initialize an EvaluationConfig instance.
- Parameters:
dataset_config (Dataset) -- Dataset configuration object
metrics (List[MetricConfig]) -- List of metric configurations
llm (Optional[LLM]) -- LLM object from orchestration_v2 (LLMModelDetails), defaults to None
template (Optional[Union[str, PromptTemplateSpec, TemplateRef]]) -- Prompt template (string, PromptTemplateSpec, or TemplateRef), defaults to None
orchestration_registry_reference (Optional[str]) -- UUID of orchestration config, defaults to None
template_variable_mapping (Optional[dict]) -- Variable mapping for prompt template, defaults to None
test_row_count (Optional[int]) -- Number of dataset rows to sample (-1 for all), defaults to -1
repetitions (Optional[int]) -- Number of evaluation repetitions (minimum: 1), defaults to 1
tags (Optional[dict]) -- Key-value metadata pairs applied to all runs, defaults to "{}"
debug_mode (Optional[bool]) -- Enable debug logging, defaults to False
- Raises:
ValueError -- If neither (llm, template) nor orchestration_registry_reference is provided
gen_ai_hub.evaluations.models.evaluation_run module
- class EvaluationRun
Bases:
objectRepresents an individual EvaluationRun object and its associated context.
- Parameters:
run_id (str) -- Unique identifier for the evaluation run
execution_id (str) -- ID of the AI Core execution
ai_core_client (AICoreV2Client) -- AI Core client instance
configuration_id (str) -- ID of the configuration, defaults to None
artifact_id (str) -- ID of the artifact, defaults to None
resource_group (str) -- Resource group name, defaults to None
object_store_credentials (_AWSObjectStoreData) -- Object store credentials, defaults to None
metrics_list (List[str]) -- List of metrics to evaluate, defaults to None
- __init__(run_id, execution_id, ai_core_client, configuration_id=None, artifact_id=None, resource_group=None, object_store_credentials=None, metrics_list=None)
- Parameters:
run_id (str)
execution_id (str)
ai_core_client (AICoreV2Client)
configuration_id (str)
artifact_id (str)
resource_group (str)
object_store_credentials (_AWSObjectStoreData)
metrics_list (List[str])
- get_current_status()
Get the current status of the evaluation run.
- Returns:
Current status of the run
- Return type:
Status
- Raises:
ValueError -- If failed to retrieve the current status
- get_debug_info()
Provide debug information when execution status is FAILED or DEAD.
- Returns:
Execution status details including failed pod information
- Return type:
- get_debug_logs()
Get the complete trace of execution logs.
- Returns:
List of log entries as dictionaries
- Return type:
list
- load_results_tables()
Download results from S3 and load the required table data.
- Returns:
Dictionary containing completions and metrics table data
- Return type:
dict
- Raises:
RuntimeError -- If failed to download results
- results()
Get the results of the evaluation run.
- Returns:
Results object for accessing completion and metric results
- Return type:
- Raises:
ValueError -- If execution is not completed
- set_cached_results_data(data)
Set the cached results data from the child results class.
- Parameters:
data (Any) -- Results data to cache
- wait_for_completion(timeout=None)
Wait for the evaluation run to complete by polling status.
- Parameters:
timeout (Optional[int]) -- Maximum time to wait in seconds, defaults to 3600 (1 hour)
- class ExecutionStatusDetails
Bases:
objectDataclass for execution status details.
- Parameters:
details (Any) -- Detailed information about the execution status
status (Any) -- Current status of the execution
- __init__(details, status)
- Parameters:
details (Any)
status (Any)
- Return type:
None
- details: Any
- status: Any
- class Results
Bases:
objectRepresents the Results handler for an EvaluationRun object.
This class provides methods to access completion results, metric results, and aggregated results for a specific evaluation run.
- Parameters:
run (EvaluationRun) -- The parent EvaluationRun object
- __init__(run)
- Parameters:
run (EvaluationRun)
- aggregations()
Get the aggregated results for the run from the tracking service.
- Returns:
JSON response containing aggregated metric results
- Return type:
dict
- Raises:
ValueError -- If error occurs while fetching aggregation results
- completions()
Get the completion results for the run.
- Returns:
DataFrame containing completion results for the run
- Return type:
pd.DataFrame
- Raises:
ValueError -- If error occurs while fetching completions
- metrics()
Get the metric-level results for the run.
- Returns:
DataFrame containing metric results for the run
- Return type:
pd.DataFrame
- Raises:
ValueError -- If error occurs while fetching metric results
- configure_pandas_display()
gen_ai_hub.evaluations.models.metric_config module
- class MetricConfig
Bases:
objectDefines the metric config of the evaluation flow
- Parameters:
reference(MetricRef): Provide the reference of metric to be evaluated, can be one of name,uuid(id), scenario/name/version variable_mapping(Optional[dict]): Any variable maping associated with the metric
- __init__(reference, variable_mapping=None)
- Parameters:
reference (MetricRef)
variable_mapping (dict)
- class MetricRef
Bases:
objectRepresents a reference to a specific metric definition.
A metric can be identified in multiple ways: - By its UUID from metric management service (id) - By name (name) - By a combination of scenario, name, and version (scenario, name, version)
- __init__(scenario=None, name=None, version=None, id=None)
- Parameters:
scenario (str)
name (str)
version (str)
id (str)