Building a Long Audio Transcription Tool with OpenAI’s Whisper API

Build a scalable audio transcription tool that handles large files, ensures accurate timestamps, and tracks progress using OpenAI’s Whisper API.

7th March 2025·MŽMatija Žiberna·

React

Building a Long Audio Transcription Tool with OpenAI’s Whisper API

In this tutorial, we’ll build a robust audio transcription tool that can handle files of any length using OpenAI’s Whisper API. The tool automatically splits large files into chunks, tracks progress, and provides timestamped output.

Source code can be found at the bottom.

What We’ve Built

We’ve created a Python-based transcription tool that solves several common challenges:

Handling large audio files (>25MB OpenAI limit)
Maintaining correct timestamps across file chunks
Resuming interrupted transcriptions
Organizing transcribed text into time intervals

Key Features

Automatic file splitting
Progress tracking and resume capability
Timestamped word-level transcription
Time-interval grouping of transcriptions
Support for multiple audio formats

Step-by-Step Guide

1. Project Setup

This involves creating a dedicated folder for your project. This helps keep all related files (code, audio, and output) organized. Inside this folder, you'll typically initialize a Python virtual environment. A virtual environment isolates your project's dependencies, preventing conflicts with other Python projects you might have on your system.

First, create a new project directory and set up the environment:

1mkdir long-audio-transcribercd long-audio-transcriberpython -m venv venvsource venv/bin/activate # On Windows: venv\Scripts\activate

mkdir long-audio-transcriber: Creates a directory named long-audio-transcriber.
cd long-audio-transcriber: Changes the current directory to the newly created one.
python -m venv venv: Creates a virtual environment named venv inside the project directory.
source venv/bin/activate: Activates the virtual environment. The command for Windows is slightly different (venv\Scripts\activate). Activating the environment ensures that any packages you install will be specific to this project.

2. Install Dependencies

This step involves installing the Python libraries needed for the project. These libraries provide pre-built functionalities, making development faster and easier.

1pip install requests ffmpeg-python python-dotenv

This pip command installs:

requests: Used for making HTTP requests to the OpenAI API.
ffmpeg-python: A Python wrapper for ffmpeg, used for audio file splitting. Remember, you need to have ffmpeg itself installed on your system.
python-dotenv: For loading environment variables from the .env file.

3. Environment Configuration

Create a .env file to store your OpenAI API key:

Environment variables securely store sensitive information, like API keys, outside your code. This prevents accidental exposure of your key.

1echo "WHISPER_API_KEY=your-api-key-here" > .env

This command creates a .env file and adds your OpenAI API key to it. Replace "your-api-key-here" with your actual API key.

The python-dotenv library will later load this key into your Python script. This single line is a bash command. It is writing the text to a file called .env.

4. Core Components

1. Audio File Splitting Implementation

The OpenAI Whisper API has a file size limit (around 25MB). To handle larger files, the script splits them into smaller, manageable chunks. ffmpeg is chosen for its efficiency and precision in audio processing, minimizing quality loss.

We used ffmpeg to split large audio files into manageable chunks:

1def split_audio_file(file_path):
2    """Split audio file into chunks smaller than 25MB."""
3    # Get audio duration using ffmpeg
4    total_duration = get_audio_duration(file_path)
5    file_size = os.path.getsize(file_path)
6    
7    # Calculate optimal number of chunks
8    num_chunks = (file_size / (MAX_SIZE_MB * 1024 * 1024)) + 1
9    chunk_duration = total_duration / num_chunks
10    
11    chunks = []
12    for i in range(int(num_chunks)):
13        start_time = i * chunk_duration
14        chunk_path = f"temp_chunks/chunk_{i:03d}.wav"
15        
16        # Extract chunk using ffmpeg
17        stream = ffmpeg.input(file_path, ss=start_time, t=chunk_duration)
18        stream = ffmpeg.output(stream, chunk_path, acodec='pcm_s16le')
19        ffmpeg.run(stream, overwrite_output=True, quiet=True)
20        
21        chunks.append(chunk_path)

Key points:

Used ffmpeg for precise audio splitting
Maintained PCM WAV format for best quality
Calculated chunk size based on file size and duration
Preserved timing information for later merging

This split_audio_file function does the following:

Gets File Information: It retrieves the audio file's total duration (get_audio_duration, a function you'd need to define separately, likely using ffmpeg.probe) and file size.
Calculates Chunks: It determines the number of chunks needed to keep each chunk below MAX_SIZE_MB (which you should define, e.g., MAX_SIZE_MB = 24). It then calculates the duration of each chunk.
Splits the Audio: It loops through the calculated number of chunks:
- start_time: Calculates the starting time for the current chunk.
- chunk_path: Creates a filename for the chunk (e.g., temp_chunks/chunk_001.wav). You'll need to create the temp_chunks directory beforehand.
- ffmpeg.input(file_path, ss=start_time, t=chunk_duration): Uses ffmpeg to select a portion of the input audio, starting at start_time and lasting for chunk_duration. ss (seek start) is used for fast and accurate seeking. t specifies the duration.
- ffmpeg.output(stream, chunk_path, acodec='pcm_s16le'): Specifies the output filename and sets the audio codec to 'pcm_s16le'. This ensures the output is a WAV file with 16-bit PCM encoding, which is lossless and compatible with Whisper.
- ffmpeg.run(stream, overwrite_output=True, quiet=True): Executes the ffmpeg command. overwrite_output=True allows overwriting existing chunk files, and quiet=True suppresses ffmpeg's console output.
- chunks.append(chunk_path): Adds the path of the created chunk to a list, which is returned at the end of the function.

2. OpenAI API Setup

Get API Key:

Go to https://platform.openai.com/
Sign up/Login
Navigate to API Keys section
Create new secret key
Save key in .env file:
WHISPER_API_KEY="your-key-here"

API Configuration:

1def transcribe_chunk(chunk_path, is_first_chunk=""):
2    """Transcribe a single chunk with timestamps."""
3    url = "https://api.openai.com/v1/audio/transcriptions"
4    headers = {"Authorization": f"Bearer {API_KEY}"}
5    
6    with open(chunk_path, "rb") as audio_file:
7        files = {"file": audio_file}
8        data = {
9            "model": "whisper-1",
10            "language": "sl",
11            "response_format": "verbose_json",
12            "timestamp_granularities[]": "word"
13        }
14        
15        response = requests.post(url, headers=headers, files=files, data=data)
16        response.raise_for_status()

The transcribe_chunk function does:

API Endpoint and Headers: Sets the API URL and creates the authorization headers using your API_KEY (loaded from the environment).
Prepares Request Data:
- with open(chunk_path, "rb") as audio_file:: Opens the audio chunk file in binary read mode ("rb").
- files = {"file": audio_file}: Prepares the file for upload in the request.
- data = { ... }: Creates a dictionary containing the request parameters:
- "model": "whisper-1": Specifies the Whisper model to use.
- "language": "sl": Sets the language to Slovenian ("sl"). Change this to the correct language code for your audio (e.g., "en" for English).
- "response_format": "verbose_json": Requests the detailed JSON response format.
- "timestamp_granularities[]": "word": Requests word-level timestamps.
Makes the API Request:
- response = requests.post(url, headers=headers, files=files, data=data): Sends a POST request to the API with the headers, file, and data.
- response.raise_for_status(): Checks for HTTP errors (e.g., 400, 500 errors). If an error occurred, this line will raise an exception, stopping the script. This is important for error handling.

3. Progress Tracking System

This system is crucial for handling long audio files and potential interruptions. It allows the script to resume processing from where it left off.

We implemented a robust progress tracking system:

Progress File Structure:

1def save_progress(chunk_path, transcription):
2    """Save progress after each chunk is processed."""
3    progress = load_progress()
4    progress['processed_chunks'][chunk_path] = transcription
5    with open(PROGRESS_FILE, 'w') as f:
6        json.dump(progress, f, indent=2)

This JSON structure stores the transcription results for each processed chunk. The keys are the chunk filenames, and the values are dictionaries containing the transcribed text and word-level timestamps.

Progress Loading:

1def load_progress():
2    """Load progress from file if it exists."""
3    if os.path.exists(PROGRESS_FILE):
4        with open(PROGRESS_FILE, 'r') as f:
5            return json.load(f)
6    return {'processed_chunks': {}, 'completed': False}

The load_progress function:

Checks for Existing File: if os.path.exists(PROGRESS_FILE): checks if a progress file (presumably named transcription_progress.json, you should define PROGRESS_FILE = "transcription_progress.json" globally) exists.
Loads Progress: If the file exists, it opens the file in read mode ('r'), loads the JSON data using json.load(f), and returns the loaded dictionary.
Initializes Progress: If the file doesn't exist (first run), it returns a new dictionary with an empty processed_chunks dictionary and completed set to False.

Completion Tracking:

1def mark_completed():
2    """Mark the transcription as completed."""
3    progress = load_progress()
4    progress['completed'] = True
5    with open(PROGRESS_FILE, 'w') as f:
6        json.dump(progress, f, indent=2)

The mark_completed function:

Loads Progress: Loads the current progress using load_progress().
Marks as Complete: Sets the 'completed' key in the progress dictionary to True.
Saves Progress: Writes the updated progress to the progress file, using json.dump() with indent=2 for readability.

4. Timestamp Management

The crucial part was maintaining correct timestamps across chunks:

Since the audio is split into chunks, the timestamps returned by Whisper are relative to the beginning of each chunk. This section shows how to adjust these timestamps to be relative to the beginning of the original audio file.

1def merge_transcriptions(transcriptions):
2    """Merge multiple transcription chunks."""
3    merged_text = ""
4    all_words = []
5    time_offset = 0
6    
7    for trans in transcriptions:
8        # Add text
9        merged_text += trans.get('text', '') + " "
10        
11        # Adjust timestamps for words
12        chunk_words = trans.get('words', [])
13        for word in chunk_words:
14            word['start'] += time_offset
15            word['end'] += time_offset
16        all_words.extend(chunk_words)
17        
18        # Update time offset for next chunk
19        if chunk_words:
20            time_offset = chunk_words[-1]['end']

The merge_transcriptions function:

Initialization: Initializes an empty string merged_text to store the combined text, an empty list all_words to store all the words with adjusted timestamps, and time_offset to 0. time_offset will keep track of the cumulative duration of the processed chunks.
Iterates Through Chunks: Loops through the transcription results (transcriptions, a list of dictionaries returned by transcribe_chunk) for each chunk.
Merges Text: Appends the transcribed text from the current chunk to merged_text.
Adjusts Timestamps:
- chunk_words = trans.get('words', []): Gets the list of word-level timestamps from the current chunk.
- for word in chunk_words:: Iterates through each word in the chunk.
- word['start'] += time_offset: Adds the time_offset to the word's start time.
- word['end'] += time_offset: Adds the time_offset to the word's end time.
- all_words.extend(chunk_words): Adds the adjusted words to the all_words list.
Updates Time Offset: if chunk_words: time_offset = chunk_words[-1]['end']: If the chunk contained words, updates time_offset to the end time of the last word in the chunk. This ensures that the timestamps for the next chunk are correctly offset. The function likeley returns merged_text and all_words.

5. Time Interval Processing

This code takes the merged, timestamp-adjusted words and groups them into user-defined time intervals (e.g., 1-minute intervals). This makes the transcript easier to navigate.

We added time-based grouping of transcriptions:

1def parse_transcription(progress_file_path="transcription_progress.json", interval_minutes=1):
2    # Load all words from chunks
3    all_words = []
4    time_offset = 0
5    
6    # Sort chunks by their number to maintain order
7    sorted_chunks = sorted(progress_data['processed_chunks'].items())
8    
9    for chunk_path, chunk_data in sorted_chunks:
10        if 'words' in chunk_data:
11            chunk_words = chunk_data['words']
12            
13            # Adjust timestamps for this chunk's words
14            for word in chunk_words:
15                word['start'] = float(word['start']) + time_offset
16                word['end'] = float(word['end']) + time_offset
17                all_words.append(word)
18            
19            # Update offset for next chunk
20            if chunk_words:
21                time_offset = all_words[-1]['end']

This function is a trimmed down version of merge_transcriptions. It performs many of the same actions, except it returns a list of words, instead of returning the merged text.

The parse_transcription function likely includes additional logic:

Loads Progress Data: Loads the progress data from the specified progress_file_path. You would need to include that part in your provided snippet or explain it.
Initializes Variables: Similar to merge_transcriptions, it initializes variables to store words and the time offset.
Sorts Chunks: It will sort the chunks, by number to maintain order
Adjusts timestamps: It adjust the time stamps of each chunk
Groups into Intervals (Missing Logic): The missing part of your code snippet is the actual grouping into intervals. It would likely involve:
- Calculating interval boundaries (e.g., 0:00-1:00, 1:00-2:00, etc.).
- Iterating through all_words and assigning each word to the appropriate interval based on its start time.
- Building a dictionary or list where keys are the time ranges (e.g., "0:00-1:00") and values are the concatenated text of words within that interval.
- Return this dictionary

6. Output Generation

The tool generates different output formats: raw text, timestamped JSON, and time-interval text.

The tool generates three types of output:

Raw Text:

1with open(OUTPUT_PATH_TEXT, "w", encoding="utf-8") as f:
2    f.write(merged_result['text'])
3

Timestamped JSON:

1with open(OUTPUT_PATH_JSON, "w", encoding="utf-8") as f:
2    json.dump(merged_result, f, ensure_ascii=False, indent=2)

Time-Interval Text:

1with open("transcription_by_intervals.txt", "w", encoding="utf-8") as f:
2    for time_range, data in sorted_intervals.items():
3        f.write(f"\n[{time_range}]\n")
4        f.write(data['text'].strip() + "\n")
5        f.write("-" * 80 + "\n")

7. Error Recovery System

We implemented several error recovery mechanisms:

Chunk Processing Recovery:

1if chunk_path in progress['processed_chunks']:
2    print(f"Loading previously processed chunk {i+1}/{len(chunk_paths)}")
3    transcriptions.append(progress['processed_chunks'][chunk_path])
4    continue

Temporary File Management:

1def cleanup_temp_directory():
2    """Remove temporary chunks only after successful completion."""
3    if progress.get('completed', False):
4        if os.path.exists(CHUNK_DIR):
5            for file in os.listdir(CHUNK_DIR):
6                os.remove(os.path.join(CHUNK_DIR, file))
7            os.rmdir(CHUNK_DIR)

This technical breakdown shows how each component works together to create a reliable transcription system that can handle files of any size while maintaining accurate timestamps and providing recovery options.

5. Using the Tool

Prepare Your Audio File

Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
No size limitation (automatically splits files)

2. Run the Transcription

python main.py

This command starts the transcription process. Make sure you are in the project directory (long-audio-transcriber) and your virtual environment is activated (source venv/bin/activate or venv\Scripts\activate) before running this.

The script will:

Split large files if needed
Process each chunk
Save progress after each chunk
Merge results with correct timestamps

3. Process Time Intervals

1python process_transcription.py
2

This step is optional but very useful. It runs a separate script (which you'd need to create, process_transcription.py) that implements the parse_transcription function and the interval grouping logic. This script would take the transcription_progress.json file as input and generate the transcription_by_intervals.txt file.

6. Output Files

The tool generates several output files:

transcription.txt: Raw transcription text
transcription_timestamps.json: JSON with word-level timestamps
transcription_by_intervals.txt: Text grouped by time intervals
transcription_progress.json: Progress tracking file

Advanced Features

Progress Tracking

The tool maintains a progress file that allows you to resume interrupted transcriptions:

1{
2  "processed_chunks": {
3    "chunk_001.wav": {
4      "text": "transcribed text...",
5      "words": [{"word": "example", "start": 0.0, "end": 0.5}]
6    }
7  }
8}

Time Interval Processing

Transcriptions are grouped into configurable time intervals:

1intervals = parse_transcription(interval_minutes=5)  # 5-minute intervals

Error Handling

The tool includes robust error handling:

Saves progress after each chunk
Maintains temporary files for resume capability
Validates input files and API responses

Conclusion

This tool makes it practical to transcribe long audio files using OpenAI’s Whisper API. It handles the complexities of file splitting, progress tracking, and timestamp management, allowing you to focus on using the transcriptions rather than managing the technical details.

The complete code is available on GitHub: long-audio-transcriber