nicetoolbox.detectors.method_detectors.whisperx.whisperx_detector.WhisperX

class nicetoolbox.detectors.method_detectors.whisperx.whisperx_detector.WhisperX(io: SequenceIO, data: SequenceData, sequence_context: SequenceRuntimeConfig, algorithm_instance: str)[source]

Bases: BaseMethod

Initialize base method detector with references.

Methods

compute_output_folders

Compute extra output folders for all components.

compute_result_folders

Compute result folders for all components.

compute_viz_folders

Compute visualization folders for all components.

post_inference

Process individual speaker aligned transcription json outputs into our final json format.

run

Execute method detector: run subprocess inference + post_inference.

visualization

Generates visualizations overlaying SRTs subtitles onto video files.

Attributes

algorithm_type

components

inference_package_name

predictions_mapping

Access predictions mapping from runtime config.

runtime

os_type

conda_path

venv

env_name

script_path

visualize

requires_out_folder

out_folders

result_folders

viz_folders

config_paths

data

io

sequence_context

detector_config

algorithm_instance

inference_config

compute_output_folders(requires_out_folder: bool) Dict[str, str]

Compute extra output folders for all components.

compute_result_folders() Dict[str, str]

Compute result folders for all components.

compute_viz_folders(visualize: bool) Dict[str, str]

Compute visualization folders for all components.

post_inference() None[source]

Process individual speaker aligned transcription json outputs into our final json format.

Structure: {

“track_name”: {
“total”: {

“text”: “full concatenated transcription text for the track”, “start”: start_time_of_first_segment, “end”: end_time_of_last_segment,

}, “segments”: [

{

“start”: segment_start_time, “end”: segment_end_time, “text”: “segment_transcription_text”, “avg_logprob”: segment_avg_log_probability,

], “word_segments”: [

{

“word”: word_text, “start”: word_start_time, “end”: word_end_time, “score”: word log probability score, “speaker”: speaker_label provided by pyannote

], “language”: detected_language

}

property predictions_mapping

Access predictions mapping from runtime config.

run() None

Execute method detector: run subprocess inference + post_inference.

Returns None - visualization uses external data.

visualization(_) None[source]

Generates visualizations overlaying SRTs subtitles onto video files.

Uses the generated SRT files from the extra outputs of the audio transcription component. These SRT files are raw outputs from WhisperX of the final speaker-aligned transcription segments with speaker labels.

We create a new video from scratch based on the video frames (if available) or a black background (if no video frames are available) and overlay the SRT subtitles onto it.