nicetoolbox.detectors.method_detectors.whisperx.whisperx_detector.WhisperX¶

class nicetoolbox.detectors.method_detectors.whisperx.whisperx_detector.WhisperX(io: SequenceIO, data: SequenceData, sequence_context: SequenceRuntimeConfig, algorithm_instance: str)[source]¶

Bases: BaseMethod

Initialize base method detector with references.

Methods

`compute_output_folders`	Compute extra output folders for all components.
`compute_result_folders`	Compute result folders for all components.
`compute_viz_folders`	Compute visualization folders for all components.
`post_inference`	Process individual speaker aligned transcription json outputs into our final json format.
`run`	Execute method detector: run subprocess inference + post_inference.
`visualization`	Generates visualizations overlaying SRTs subtitles onto video files.

Attributes

`algorithm_type`
`components`
`inference_package_name`
`predictions_mapping`	Access predictions mapping from runtime config.
`runtime`
`os_type`
`conda_path`
`venv`
`env_name`
`script_path`
`visualize`
`requires_out_folder`
`out_folders`
`result_folders`
`viz_folders`
`config_paths`
`data`
`io`
`sequence_context`
`detector_config`
`algorithm_instance`
`inference_config`

compute_output_folders(requires_out_folder: bool) → Dict[str, str]¶: Compute extra output folders for all components.

compute_result_folders() → Dict[str, str]¶: Compute result folders for all components.

compute_viz_folders(visualize: bool) → Dict[str, str]¶: Compute visualization folders for all components.

post_inference() → None[source]¶

Process individual speaker aligned transcription json outputs into our final json format.

Structure: {

“track_name”: {

“total”: {
“text”: “full concatenated transcription text for the track”, “start”: start_time_of_first_segment, “end”: end_time_of_last_segment,

}, “segments”: [

{
“start”: segment_start_time, “end”: segment_end_time, “text”: “segment_transcription_text”, “avg_logprob”: segment_avg_log_probability,

], “word_segments”: [

{
“word”: word_text, “start”: word_start_time, “end”: word_end_time, “score”: word log probability score, “speaker”: speaker_label provided by pyannote

], “language”: detected_language

}

property predictions_mapping¶: Access predictions mapping from runtime config.

run() → None¶

Execute method detector: run subprocess inference + post_inference.

Returns None - visualization uses external data.

visualization(_) → None[source]¶

Generates visualizations overlaying SRTs subtitles onto video files.

Uses the generated SRT files from the extra outputs of the audio transcription component. These SRT files are raw outputs from WhisperX of the final speaker-aligned transcription segments with speaker labels.

We create a new video from scratch based on the video frames (if available) or a black background (if no video frames are available) and overlay the SRT subtitles onto it.