Evaluation Results Wrapper Tutorial (python)¶
The Evaluation Results Wrapper provides a simple, pandas-like interface for querying and analyzing evaluation results stored in .npz files.
Table of Contents¶
Overview¶
The EvaluationResults class wraps evaluation metrics stored in .npz files and provides a pandas DataFrame–backed API for:
Loading and indexing multi-dimensional evaluation results (persons, cameras, frames, labels, etc.)
Filtering by dataset, sequence, algorithm, component, person, camera, metric name, and label
Aggregating metrics with built-in functions (mean, std, min, max, median, etc.)
Exporting results to CSV or DataFrames for downstream analysis
Key Benefits:
No manual
.npzparsingConsistent query and aggregation interface
Seamless integration with pandas for custom analysis
Quick Start¶
from pathlib import Path
from nicetoolbox.evaluation.results_wrapper import EvaluationResults
# Load results from evaluation output folder
root = Path("/path/to/output_folder/20240315_eval")
results = EvaluationResults(root=root, memory_optimized=True)
# Inspect available data
print(results)
# Query specific algorithm and person
results.query(algorithm="hrnetw48", person="p1")
# Aggregate and export
agg_df = results.aggregate(group_by=["algorithm", "metric_name"]).to_dataframe()
print(agg_df)
Core Operations¶
Loading Results¶
Initialize from an evaluation output folder containing .npz metric files:
from pathlib import Path
from nicetoolbox.evaluation.results_wrapper import EvaluationResults
root = Path("/outputs/experiments/20251115_eval")
results = EvaluationResults(root=root)
Inspecting Data¶
Quick Overview:
print(results)
Preview DataFrame:
df = results.to_dataframe()
print(df.head(10))
Reset to Original View:¶
results.reset()
[!NOTE] Each
query()oraggregate()call updates the internal state of the results. Usereset()to restore the original dataset when you want to start a new analysis.
Querying / Filtering¶
Filter the internal DataFrame using keyword arguments. Supports both single values or lists:
Single Value Queries:
# Filter by dataset
results.query(dataset="dataset_A")
# Filter by algorithm
results.query(algorithm="hrnetw48")
# Filter by metric name
results.query(metric_name="jpe")
Multi-Value Queries:
# Select multiple labels
results.query(label=["nose", "left_knee", "right_knee"])
# Select multiple algorithms
results.query(algorithm=["hrnetw48", "vitpose"])
Easily combine queries in a single call:
results.query(
dataset="dataset_A",
algorithm="vitpose",
person=["p1", "p2"]
)
Aggregation¶
Compute summary statistics grouped by specified columns:
Basic Aggregation (Default: Mean):
agg_df = results.aggregate(group_by=["dataset", "algorithm", "metric_name"]).to_dataframe()
print(agg_df)
Output:
dataset algorithm metric_name mean_value
0 dataset_A hrnetw48 jpe 43.2
1 dataset_A vitpose jpe 38.7
2 dataset_B hrnetw48 bone_length 152.3
...
Multi-Statistic Aggregation:
agg_funcs = ["mean", "std", "min", "max", "median"]
summary = results.aggregate(
group_by=["algorithm", "metric_name"],
agg_funcs=agg_funcs
).to_dataframe()
print(summary)
Output:
algorithm metric_name mean std min max median
0 hrnetw48 jpe 43.2 12.1 15.4 98.3 41.5
1 vitpose jpe 38.7 10.5 12.1 87.6 37.2
Supported Aggregation Functions:
"mean","std","min","max","median","sum","count"
[!NOTE] Aggregation creates a new
EvaluationResultsinstance with the aggregated DataFrame. Use.to_dataframe()to access it. Use.reset()to restore the original view to all available results.
Exporting Results¶
Export to pandas DataFrame:
df = results.to_dataframe()
Export to CSV:
output_path = results.to_csv(
output_dir=Path("./exports"),
base_name="my_results"
)
print(f"Exported to: {output_path}")
Chainable operations¶
The functions reset(), query() and aggregate() return the mutated instance of the EvaluationResults class. This allows for chaining multiple function class. Below we have provided some common use cases that make use of this.
Common Use Cases¶
Use Case 1: Compare Algorithms Across Cameras¶
Goal: In the context of human pose estimation (HPE), compare JPE performance of two algorithms broken down by camera.
from pathlib import Path
from nicetoolbox.evaluation.results_wrapper import EvaluationResults
root = Path("/outputs/experiments/20240315_eval")
results = EvaluationResults(root=root)
camera_comparison = (
results.query(metric_name="jpe")
.aggregate(group_by=["dataset", "algorithm", "camera", "metric_name"])
.to_dataframe()
)
print(camera_comparison)
Output:
dataset algorithm camera metric_name mean
0 dataset_A hrnetw48 c1 jpe 42.3
1 dataset_A hrnetw48 c2 jpe 45.1
3 dataset_A vitpose c1 jpe 38.2
4 dataset_A vitpose c2 jpe 39.5
6 dataset_B hrnetw48 c1 jpe 50.4
7 dataset_B hrnetw48 c2 jpe 52.6
9 dataset_B vitpose c1 jpe 45.7
10 dataset_B vitpose c2 jpe 47.3
...
Insight: Identify which cameras have higher errors and which algorithm performs better per viewpoint.
Use Case 2: Multi-Statistic Summary for a Dataset¶
Goal: Generate a detailed summary table with multiple statistics for one dataset.
results.reset()
summary = (
results.query(dataset="dataset_A")
.aggregate(
group_by=["algorithm", "metric_name"],
agg_funcs=["mean", "std", "min", "max", "median"]
)
.to_dataframe()
)
print(summary)
Output:
algorithm metric_name mean std min max median
0 hrnetw48 jpe 43.2 12.1 15.4 98.3 41.5
1 hrnetw48 bone_length 152.3 8.7 130.2 175.4 151.8
2 vitpose jpe 38.7 10.5 12.1 87.6 37.2
3 vitpose bone_length 153.1 7.9 135.6 172.3 152.5
Insight: Easily include multiple statistics in reports or e.g. compare robustness (std) across algorithms.
Use Case 3: Export Subset for External Analysis¶
Goal: Export specific labels (nose and knees) aggregated by dataset, sequence, and algorithm.
output_folder = Path("./exports")
output_path = (
results.reset()
.query(label=["left_knee", "right_knee"])
.aggregate(group_by=["dataset", "sequence", "algorithm", "label"])
.to_csv(output_dir=output_folder, base_name="knees_summary")
)
print(f"Exported to: {output_path}")
Output:
Exported to: ./exports/nose_and_knees_summary.csv
CSV Content:
dataset,sequence,algorithm,label,mean
dataset_A,seq_01,hrnetw48,left_knee,55.2
dataset_A,seq_01,hrnetw48,right_knee,54.8
dataset_A,seq_01,vitpose,left_knee,50.3
dataset_A,seq_01,vitpose,right_knee,51.1
...
Insight: Share with collaborators or import into Excel/R for further analysis.
Use Case 4: Identify Problematic Labels¶
Goal: Find joints with the highest mean JPE for a specific algorithm. Given that GT annotations are available.
results.reset()
hrnet_jpe = (
results.query(dataset="your_dataset", algorithm="vitpose", metric_name="jpe")
.aggregate(group_by=["label"])
.to_dataframe()
)
worst_joints = hrnet_jpe.sort_values("mean_value", ascending=False).head(5)
print(worst_joints)
Output:
label mean_value
17 left_ankle 78.2
18 right_ankle 76.5
12 left_wrist 65.3
13 right_wrist 63.8
9 left_knee 58.1
Best Practices¶
Chain operations for clarity:
results.query(...).aggregate(...).to_dataframe()Reset between analyses: Use
reset()to avoid stale filters
Questions or Issues?¶
Refer to the Evaluation Tutorial, open an issue on our GitHub or contact us.