How to Implement MediaRecorder API Audio Recording and Transcription with iPhone Safari Support

Cross-device audio recording with MediaRecorder, correct encoding, and Google Speech-to-Text

·Matija Žiberna·
How to Implement MediaRecorder API Audio Recording and Transcription with iPhone Safari Support

I was building a voice recording feature for a client project when I discovered something frustrating: audio transcription worked perfectly on desktop and Android devices, but consistently failed on iPhones. After diving deep into the MediaRecorder API and Google Speech-to-Text integration, I realized the issue wasn't just a simple bug—it was a fundamental difference in how iPhone Safari handles audio recording.

This guide walks you through building a complete audio recording and transcription system that works seamlessly across all devices, including the tricky iPhone Safari case. By the end, you'll have a robust implementation that properly handles different audio formats and integrates smoothly with Google's Speech-to-Text API.

Understanding the iPhone Safari Challenge

Before jumping into code, it's crucial to understand why iPhone Safari requires special handling. Most browsers support multiple audio formats for MediaRecorder, but iPhone Safari has specific preferences:

  • Desktop Chrome/Firefox: Often defaults to audio/webm or audio/wav
  • Android Chrome: Typically uses audio/webm
  • iPhone Safari: Produces audio/webm;codecs=opus specifically

The problem occurs when you hardcode audio format assumptions. If your system expects WAV files but receives WebM/Opus from iPhone Safari, transcription services like Google Speech-to-Text will reject the audio with encoding errors.

Step 1: Setting Up Smart MediaRecorder Format Detection

The foundation of cross-device compatibility is proper format detection. Instead of assuming a format, we need to detect what each device supports and choose appropriately.

// File: src/components/audio-recorder.tsx
const startRecording = useCallback(async () => {
  try {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    
    // Smart format detection with iPhone priority
    let selectedMimeType = 'audio/webm'; // fallback
    const supportedTypes = [
      'audio/webm;codecs=opus',  // iPhone Safari preference
      'audio/webm',
      'audio/mp4',
      'audio/wav'
    ];
    
    for (const type of supportedTypes) {
      if (MediaRecorder.isTypeSupported(type)) {
        selectedMimeType = type;
        break;
      }
    }
    
    console.log(`Selected audio format: ${selectedMimeType}`);
    
    const recorder = new MediaRecorder(stream, {
      mimeType: selectedMimeType
    });
    
    // Store chunks for later blob creation
    const chunks: BlobPart[] = [];
    
    recorder.ondataavailable = (event) => {
      if (event.data.size > 0) {
        chunks.push(event.data);
      }
    };
    
    recorder.onstop = () => {
      // Critical: Use the actual detected MIME type, not a hardcoded one
      const audioBlob = new Blob(chunks, { type: selectedMimeType });
      onRecordingComplete(audioBlob);
    };
    
    recorder.start();
    setMediaRecorder(recorder);
    
  } catch (error) {
    console.error('Recording failed:', error);
  }
}, []);

This approach prioritizes iPhone Safari's preferred format while maintaining compatibility with other browsers. The key insight is using MediaRecorder.isTypeSupported() to test formats in order of preference, ensuring we get the best format each device can produce.

Step 2: Implementing Proper File Upload with Format Awareness

Once you have an audio blob with the correct MIME type, the upload system needs to handle different formats appropriately. The critical piece is mapping MIME types to correct file extensions.

// File: src/lib/upload-handler.ts
const getCorrectExtension = (mimeType: string): string => {
  const mimeToExt: Record<string, string> = {
    'audio/webm': 'webm',
    'audio/webm;codecs=opus': 'webm',    // iPhone Safari
    'audio/mp4': 'm4a',
    'audio/wav': 'wav',
    'audio/mpeg': 'mp3',
    'audio/flac': 'flac',
    'audio/ogg': 'ogg'
  };
  
  return mimeToExt[mimeType] || 'webm';  // Default to webm for iPhone compatibility
};

export const uploadAudioFile = async (audioBlob: Blob): Promise<string> => {
  const detectedExtension = getCorrectExtension(audioBlob.type);
  const fileName = `${Date.now()}.${detectedExtension}`;
  
  console.log(`Uploading audio: ${audioBlob.type} -> ${fileName}`);
  
  const formData = new FormData();
  formData.append('audio', audioBlob, fileName);
  
  const response = await fetch('/api/upload/audio', {
    method: 'POST',
    body: formData,
  });
  
  if (!response.ok) {
    throw new Error('Upload failed');
  }
  
  const { fileUrl } = await response.json();
  return fileUrl;
};

The extension mapping is crucial because Google Speech-to-Text API uses file extensions to help determine encoding. When iPhone Safari produces WebM/Opus audio, it needs to be saved with a .webm extension, not .wav.

Step 3: Configuring Google Speech-to-Text API with Dynamic Encoding

The most critical part of the implementation is configuring the Google Speech API with the correct encoding based on your uploaded audio format. Missing or incorrect encoding parameters cause the "bad encoding" errors.

// File: src/lib/google-speech.ts
import { SpeechClient } from '@google-cloud/speech';

const detectEncodingFromFile = (fileName: string): string => {
  const extension = fileName.split('.').pop()?.toLowerCase();
  
  switch (extension) {
    case 'webm': return 'WEBM_OPUS';    // iPhone Safari files
    case 'wav': return 'LINEAR16';
    case 'mp3': return 'MP3';
    case 'm4a':
    case 'mp4': return 'MP3';
    default: return 'WEBM_OPUS';        // Safe default for iPhone
  }
};

const getModelForLanguage = (languageCode: string): string | undefined => {
  // Enhanced models are only available for certain languages
  const enhancedModelLanguages = ['en-US', 'en-GB'];
  return enhancedModelLanguages.includes(languageCode) ? 'latest_long' : undefined;
};

export const transcribeAudio = async (
  gcsUri: string, 
  languageCode: string = 'en-US'
): Promise<string> => {
  const client = new SpeechClient();
  
  // Extract filename from GCS URI to detect encoding
  const fileName = gcsUri.split('/').pop() || '';
  const detectedEncoding = detectEncodingFromFile(fileName);
  const model = getModelForLanguage(languageCode);
  
  console.log(`Transcribing audio: ${fileName}`);
  console.log(`Detected encoding: ${detectedEncoding}`);
  console.log(`Language: ${languageCode}, Model: ${model || 'default'}`);
  
  const request = {
    config: {
      languageCode,
      encoding: detectedEncoding, // This is critical for iPhone compatibility
      enableAutomaticPunctuation: true,
      ...(model && { model }) // Only include model if supported
    },
    audio: { uri: gcsUri }
  };
  
  try {
    const [response] = await client.recognize(request);
    const transcription = response.results
      ?.map(result => result.alternatives?.[0]?.transcript)
      .filter(Boolean)
      .join(' ') || '';
      
    return transcription;
  } catch (error) {
    console.error('Transcription failed:', error);
    throw new Error('Failed to transcribe audio');
  }
};

The encoding detection is the heart of iPhone compatibility. When the API receives a .webm file, it knows to expect WEBM_OPUS encoding rather than trying to process it as LINEAR16 (which would cause encoding errors).

Step 4: Adding Robust Audio Validation

Both client and server-side validation need to handle the variety of MIME types that different devices produce, including codec specifications.

// File: src/lib/audio-validation.ts
const ALLOWED_AUDIO_TYPES = [
  'audio/wav',
  'audio/mpeg', 
  'audio/mp4',
  'audio/webm',
  'application/octet-stream' // Fallback for some uploads
];

export const validateAudioFile = (file: Blob | File): boolean => {
  if (!file.type) {
    console.warn('File has no MIME type, allowing as fallback');
    return true; // Allow files without MIME type
  }
  
  // Use prefix matching to handle codec specifications
  // This accepts "audio/webm;codecs=opus" when "audio/webm" is allowed
  const isAllowed = ALLOWED_AUDIO_TYPES.some(allowedType => 
    file.type.startsWith(allowedType)
  );
  
  if (!isAllowed) {
    console.error(`Unsupported audio type: ${file.type}`);
  }
  
  return isAllowed;
};
// File: src/app/api/upload/audio/route.ts
export async function POST(request: Request) {
  const formData = await request.formData();
  const audioFile = formData.get('audio') as File;
  
  if (!audioFile) {
    return NextResponse.json({ error: 'No audio file provided' }, { status: 400 });
  }
  
  // Server-side validation with same logic
  if (!validateAudioFile(audioFile)) {
    return NextResponse.json({ error: 'Invalid audio format' }, { status: 400 });
  }
  
  // Upload to your storage service (Google Cloud Storage, S3, etc.)
  const fileUrl = await uploadToStorage(audioFile);
  
  return NextResponse.json({ fileUrl });
}

The prefix matching approach is essential because iPhone Safari sends audio/webm;codecs=opus, but your allowed types list contains audio/webm. Exact string matching would reject this perfectly valid format.

Step 5: Building a Complete Recording Component

Here's how all the pieces fit together in a complete React component:

// File: src/components/voice-recorder.tsx
import { useState, useRef, useCallback } from 'react';
import { uploadAudioFile } from '@/lib/upload-handler';
import { transcribeAudio } from '@/lib/google-speech';
import { validateAudioFile } from '@/lib/audio-validation';

export const VoiceRecorder = () => {
  const [isRecording, setIsRecording] = useState(false);
  const [isProcessing, setIsProcessing] = useState(false);
  const [transcription, setTranscription] = useState('');
  const mediaRecorderRef = useRef<MediaRecorder | null>(null);
  const chunksRef = useRef<BlobPart[]>([]);

  const startRecording = useCallback(async () => {
    try {
      const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
      
      // Format detection logic from Step 1
      let selectedMimeType = 'audio/webm';
      const supportedTypes = [
        'audio/webm;codecs=opus',
        'audio/webm',
        'audio/mp4',
        'audio/wav'
      ];
      
      for (const type of supportedTypes) {
        if (MediaRecorder.isTypeSupported(type)) {
          selectedMimeType = type;
          break;
        }
      }
      
      const recorder = new MediaRecorder(stream, { mimeType: selectedMimeType });
      chunksRef.current = [];
      
      recorder.ondataavailable = (event) => {
        if (event.data.size > 0) {
          chunksRef.current.push(event.data);
        }
      };
      
      recorder.onstop = async () => {
        const audioBlob = new Blob(chunksRef.current, { type: selectedMimeType });
        await processRecording(audioBlob);
      };
      
      recorder.start();
      mediaRecorderRef.current = recorder;
      setIsRecording(true);
      
    } catch (error) {
      console.error('Failed to start recording:', error);
    }
  }, []);

  const stopRecording = useCallback(() => {
    if (mediaRecorderRef.current && isRecording) {
      mediaRecorderRef.current.stop();
      mediaRecorderRef.current.stream.getTracks().forEach(track => track.stop());
      setIsRecording(false);
    }
  }, [isRecording]);

  const processRecording = async (audioBlob: Blob) => {
    setIsProcessing(true);
    
    try {
      // Validate the audio file
      if (!validateAudioFile(audioBlob)) {
        throw new Error('Invalid audio format');
      }
      
      // Upload the audio file
      const fileUrl = await uploadAudioFile(audioBlob);
      
      // Transcribe using Google Speech API
      const result = await transcribeAudio(fileUrl);
      setTranscription(result);
      
    } catch (error) {
      console.error('Processing failed:', error);
    } finally {
      setIsProcessing(false);
    }
  };

  return (
    <div className="p-4">
      <div className="flex gap-4 mb-4">
        <button
          onClick={isRecording ? stopRecording : startRecording}
          disabled={isProcessing}
          className="px-4 py-2 bg-blue-500 text-white rounded disabled:opacity-50"
        >
          {isRecording ? 'Stop Recording' : 'Start Recording'}
        </button>
      </div>
      
      {isProcessing && (
        <div className="text-gray-600">Processing audio...</div>
      )}
      
      {transcription && (
        <div className="mt-4 p-3 bg-gray-100 rounded">
          <h3 className="font-semibold mb-2">Transcription:</h3>
          <p>{transcription}</p>
        </div>
      )}
    </div>
  );
};

Step 6: Testing Across Devices

Since testing on actual iPhones during development isn't always practical, you can implement device simulation for testing different format scenarios:

// File: src/components/device-simulator.tsx (development only)
type DevicePreset = {
  name: string;
  description: string;
  forceFormat: string;
};

const DEVICE_PRESETS: Record<string, DevicePreset> = {
  'iphone-safari': {
    name: 'iPhone Safari',
    description: 'WebM/Opus',
    forceFormat: 'audio/webm;codecs=opus'
  },
  'android-chrome': {
    name: 'Android Chrome',
    description: 'WebM',
    forceFormat: 'audio/webm'
  },
  'desktop-chrome': {
    name: 'Desktop Chrome',
    description: 'WAV',
    forceFormat: 'audio/wav'
  }
};

// In your recording component, add development-only simulation
const startRecording = useCallback(async () => {
  // ... existing code ...
  
  // Development simulation (only show on localhost)
  if (process.env.NODE_ENV === 'development' && selectedSimulation) {
    const preset = DEVICE_PRESETS[selectedSimulation];
    if (preset && MediaRecorder.isTypeSupported(preset.forceFormat)) {
      selectedMimeType = preset.forceFormat;
      console.log(`🧪 SIMULATING ${preset.name}: ${selectedMimeType}`);
    }
  }
  
  // ... rest of recording code ...
}, [selectedSimulation]);

This simulation approach lets you test iPhone Safari behavior on your development machine, ensuring your format detection and encoding logic work correctly before deploying.

Monitoring and Debugging

Add comprehensive logging to troubleshoot issues across different devices:

// Enhanced logging throughout your implementation
console.log(`[MediaRecorder] Detected format: ${selectedMimeType}`);
console.log(`[Upload] File: ${fileName}, Size: ${audioBlob.size} bytes`);
console.log(`[Speech API] Encoding: ${detectedEncoding}, Language: ${languageCode}`);

This logging helps identify exactly where format mismatches occur and ensures each step of the pipeline handles device-specific formats correctly.

Conclusion

Building audio recording that works seamlessly across all devices, especially iPhone Safari, requires understanding the nuances of how different browsers handle MediaRecorder formats. The key is building a system that detects and adapts to each device's preferred format rather than making assumptions.

The implementation covers the complete pipeline: smart format detection in MediaRecorder, proper file extensions during upload, correct encoding configuration for Google Speech API, and robust validation that handles codec specifications. With device simulation for development testing, you can ensure your implementation works across all target devices.

This approach gives you a production-ready audio recording and transcription system that handles the tricky iPhone Safari case while maintaining compatibility with all other browsers. Let me know in the comments if you have questions, and subscribe for more practical development guides.

Thanks, Matija

0

Comments

Enjoyed this article?
Subscribe to my newsletter for more insights and tutorials.
Matija Žiberna
Matija Žiberna
Full-stack developer, co-founder

I'm Matija Žiberna, a self-taught full-stack developer and co-founder passionate about building products, writing clean code, and figuring out how to turn ideas into businesses. I write about web development with Next.js, lessons from entrepreneurship, and the journey of learning by doing. My goal is to provide value through code—whether it's through tools, content, or real-world software.

You might be interested in