NICE Toolbox Evaluation Overview¶
This tutorial guides you through the evaluation pipeline of the NICE Toolbox.
Table of Contents¶
Overview¶
The NICE Toolbox evaluation pipeline allows for comprehensive assessment of NICE toolbox detectors and algorithms using a variety of metrics. It supports evaluations with and without ground truth annotations, making it versatile for different experimental setups.
Key Features:
Quality metrics for evaluation without ground truth annotations. When no labels are available or labeling is expensive.
Classic evaluation of detector prediction and ground truth annotation pairs for selected datasets.
Detailed per-frame scores and aggregated summary statistics.
API for advanced analysis of raw evaluation results.
[!NOTE] Currently we have only implemented a limit set of datasets with GT annotations. We are working on simplifying the process of adding user datasets with their own label set.
Configuration Setup¶
Machine Specific Paths¶
Ensure ./machine_specific_paths.toml contains:
datasets_folder_path = "/absolute/path/to/datasets"
output_folder_path = "/absolute/path/to/outputs"
This is set up by default during the installation.
Evaluation Config¶
The main configuration file is located at ./configs/evaluation_config.toml.
(1) Global Settings:¶
git_hash: Automatically filled with current git commit hashdevice: Computing device ("cpu"or"cuda:0")batchsize: Number of frames processed per batch (higher = faster but more memory)verbose: Enable detailed logging and CSV exports including automatic summariesskip_evaluation: Skip main loop and only regenerate summaries from existing results
Here is an example:
# Global Settings
git_hash = "<git_hash>"
device = "cuda:0"
batchsize = 10000
verbose = true
skip_evaluation = false
(2) Evaluation IO¶
experiment_name: The evaluation pipeline needs access to the results of the NICE toolbox detectors. Using the first 2 lines of the IO config, you specify the path to the folder of the detector results via the experiment_name and experiment_folder field.output_folder: The outputs and results of the evaluation will be exported to this folder.
[io]
# Evaluation input folders (NICE toolbox detector output folders)
experiment_name = "<yyyymmdd>" # Default placeholder for experiments run on the same day.
experiment_folder = "<output_folder_path>/experiments/<experiment_name>" # Default folder for all experiments
# Evaluation output folders
output_folder = "<experiment_folder>_eval" # Output folder of evaluation
eval_visualization_folder = "<output_folder>/visualization"
(3) Metric selection and configuration¶
Metrics are grouped under categories called metric_type that are compatible with specific data types. Each metric_type has its own config and selection. For a detailed explanation and an overview of all available metrics, please refer to the evaluation metrics wiki page.
In this section, please select the metrics to be run. For each metric type, you can specify the list of metric names to be computed with the metric_names field.
[metrics.point_cloud_metrics] # Metric type (here: point cloud metrics)
metric_names = ["jpe"] # List of metric names to compute
gt_required = true
[metrics.keypoint_metrics]
metric_names = ["jump_detection", "bone_length"]
gt_required = false
gt_components = ["body_joints", "hand_joints", "face_landmarks"]
keypoint_mapping_file = "configs/predictions_mapping.toml"
[metrics.categorical_metrics]
metric_names = ["accuracy", "precision", "recall", "f1_score"]
gt_required = true
(4) Metric aggregation summaries¶
Here you can define multiple summaries with different aggregation settings that are automatically computed after evaluation when verbose is set to true. Please refer to the section below for more details.
[summaries.bone_length_report] # Name of the summary
metric_names = ["bone_length"] # List of metric names to include in the summary
aggr_functions = ["mean", "std", "min", "max"] # List of aggregation functions to apply
filter = {dataset = "communication_multiview"} # Filters to apply before aggregation
aggregate_dims = ["sequence", "person", "camera", "frame"] # Dimensions to aggregate over
Running the Evaluation¶
Step 1: Input selection¶
Ensure your experiment folder contains detector outputs (.npz files) and select it inside the evaluation_config under [IO]. See the evaluation config descriptions above.
Ground truth annotations (if required by selected metrics) need to be processed and stored inside the dataset folder. A tutorial on how to add custom datasets with annotations will be available soon. Please contact us for more information or future collaborations.
Step 2: Configure Metrics¶
Edit ./configs/evaluation_config.toml to select desired metrics:
[metrics.point_cloud_metrics]
metric_names = ["jpe"]
[metrics.keypoint_metrics]
metric_names = ["bone_length"]
Optionally, edit the summary reports list as well.
Step 3: Run Evaluation¶
cd /path/to/nicetoolbox/
envs\nicetoolbox\Scripts\activate # Windows
source ./envs/nicetoolbox/bin/activate # Linux
run_evaluation
Step 4: Monitor Progress and check the results¶
Check the log file at /path/to/output_folder/evaluation.log and verify the success of the pipeline by looking at the results.
Summary Generation¶
After evaluation completes, summaries are automatically generated in <output_folder>/csv_files/summaries/. For this to happen, you need to set verbose=true inside the evaluation config under the global settings up top. In addition, you need to configure summaries in the 4th part of the evaluation config:
# === (4) Metric aggregation summaries ===
# Here you can define multiple summaries with different aggregation settings
# that are automatically computed after evaluation when `verbose` is set to true.
Based on the example dataset, there are already a few summaries provided that showcase the flexibility of these automatic reports.
Customizing Summaries¶
Edit summaries in the evaluation config. Here you can customize the following per summary:
The name of the summary report. It will be used for exporting the results.
# Name of the summary (Only change the part after "summaries.") [summaries.bone_length_report]
The metrics to include in the summary.
metric_names = ["bone_length"] # List of metric names to include in the summary
The aggregation functions to apply (e.g., mean, std, min, max).
aggr_functions = ["mean", "std", "min", "max"] # List of aggregation functions to apply
Filters to apply before aggregation. Here you can flexibly filter along all your data dimensions. These include: metric_name, dataset, sequence, component, algorithm, metric_type, person, camera, label. Add these keys to the
filterdictionary and add the values that you want to query in your data.filter = {metric_name=["jump_detection"], label=["left_wrist", "right_wrist"]} # Using this filter, any aggregation selected will only be applied and exported to the jump_detection metric with further specification of the keypoints of interest (only the two wrists joints).
The dimensions to aggregate over. Please select a subset of sequence, person, camera, frame and label.
aggregate_dims = ["person", "camera", "frame", "label"] # Dimensions to aggregate over
Regenerating Summaries¶
To regenerate summaries without re-running evaluation, set
skip_evaluation = true
in the first part of the evaluation_config and then rerun the nicetoolbox evaluation with:
run_evaluation
Understanding Results¶
Output Structure¶
<output_folder>/
├── evaluation.log
├── config_<time>.log
├── <dataset_name>__<sequence_ID>/
│ ├── <component_name>/
│ │ ├── <algorithm_name>__<metric_type>.npz
│ │ └── ... # One npz file for each metric_type
│ └── csv_files/
│ ├── summaries/
│ │ ├── <summary_name>.csv
│ │ └── ... # One csv file for each configured summary report
│ ├── <dataset_name>__<sequence_ID>_<component>_<algorithm>__<metric_type>_<metric_name>.csv
│ └── ... # One csv file for each metric
└── visualization/
└── Coming Soon
What is inside a single .npz file?¶
Lets look at the body_joints (component) results for vitpose (algorithm):
├── vitpose__keypoint_metrics.npz
│ ├── bone_length.npy # numpy array with shape ( # persons, # cameras, # frames, # bone lengths (# bone names) )
│ ├── jump_detection.npy # numpy array with shape ( # persons, # cameras, # frames, # jump metric scores (# body joints) )
│ └── data_description.npy # dictionary of dicts with Dict["axis0"=List[persons], "axis1"=List[cameras], "axis2"=List[frames], "axis3"=List[labels]]
├── vitpose__point_cloud_metrics.npz
│ ├── jpe.npy # numpy array with shape ( # persons, # cameras, # frames, # jpe metric scores (# body joints) )
│ └── data_description.npy # dictionary of dicts with Dict["axis0"=List[persons], "axis1"=List[cameras], "axis2"=List[frames], "axis3"=List[labels]]
Result Files¶
Each .npz file contains:
Metric arrays: Multi-dimensional arrays indexed by
(person, camera, frame, label)Description dictionaries: Metadata called
data_descriptionfor each metric array. The data_description is a dictionary that describes each dimension of the given output npy arraysFor example:
data_description = { "bone_length": { axis0=["person_left", "person_right"], # A list of persons in the video axis1=["camera_front", "camera_top"], # A list of camera perspectives in multi-camera setups axis2=[frames] # The frames that were processed axis3=["left_lower_leg", "right_lower_leg", ..., "right_upper_arm"] # The labels (Bone names -> lengths) }, "jump_detection": { ... # Axis 0 to 2 would be equal axis3=["nose", "left_wrist", "right_wrist", ..., "right_toe"] # The labels (Keypoints/Joints -> Jumps) } }
Evaluation Results Wrapper (python)¶
We have created a simple python API to allow for more complex analysis of the raw evaluation results with the following features:
Fine grained querying/filtering of high dimensional data
Flexible aggregations
Converting to pandas DataFrame
Exporting to .csv files
Please refer to the tutorial to get started on using the pandas based API.