Extracting Audio From a Video with Web Audio Api

March 9, 2018

For past few days, I had an idea site to make a site like GoSubtitle, where you could upload video files and automatically generate subtitles on many languages. There would be no point in doing so, as long as it is not a tool with a better UI that works faster, gives more precise results than GoSubtitle.

When I reviewed Google's Cloud Speech API, as I also predicted before reviewing, only an audio file would be sufficient to produce the required text . On the client side, extracting the audio file from the video file of the user uploaded, and uploading only the extracted audio data to our backend will produce great efficiency. But is there any API to do this on the client side and was it supported by modern browsers?

After a little research, I came across with Web Audio Api. An open source library kicked off by W3C Audio WG that enables client-side processing of audio sources. A javascript API mainly used for processing and synthesizing sounds.

My first experiments by using commonly used methods was failure. Because what I was trying to do was not processing the sound that was currently playing.

// Wait for window.onload to fire. See crbug.com/112368
window.addEventListener('load', function (e) {
    // Our <video> element will be the audio soundSource.
    var videoElement = document.getElementById('myVideo');
    soundSource = audioContext.createMediaElementSource(videoElement);
    var analyserNode = audioContext.createAnalyser();


    window.setInterval(function () {
        sample = new Float32Array(analyserNode.frequencyBinCount);
        analyserNode.getFloatFrequencyData(sample);  // getByteFrequencyData for more performance, less precision.
        totalSample = Float32Concat(totalSample, sample);
    }, 1000);
}, false);
Gathering audio data from currently playing video element

I was sure this wasn't what I was looking for. Because playing the video file uploaded by the user and collect audio data while it's playing would be even more inefficient than uploading the entire video to the backend.

After that, I was able to decode the audio data by converting the video file to an ArrayBuffer then processing the buffer with decodeAudioData method of AudioContext class.

reader.readAsArrayBuffer(blob); // video file as blob
reader.onload = function () {
  var videoFileAsBuffer = reader.result; // ArrayBuffer
  audioContext.decodeAudioData(videoFileAsBuffer).then(function (decodedAudioData) {
    console.log(decodedAudioData); // AudioBuffer
Decoding an audio data of a video file

However, the file size of the decoded raw audio data was even higher than the original video file. In this case, it would not make any sense to extract the audio data from video file on the client side. Fortunately, all the sound processing methods that could be used in the Web Audio API were not meant to be in real-time.

With the help of OfflineAudioContext class, it is possible to process and synthesize audio files as quickly as possible. For instance, if you want to process an audio data on client side without playing the audio, but with a grant of a download button, this can all be done with the help of OfflineAudioContext class.

I used the class to re-render the decoded sound data with different values (for example, fixing the sample rate to 16KHz). Thanks to this, I was able to keep all the sound data at the same quality standard before I sent them to backend.

var offlineAudioContext = new OfflineAudioContext(numberOfChannels, sampleRate * duration, sampleRate);
var soundSource = offlineAudioContext.createBufferSource();
reader.readAsArrayBuffer(blob); // video file as blob
reader.onload = function () {
  var videoFileAsBuffer = reader.result; // ArrayBuffer
  audioContext.decodeAudioData(videoFileAsBuffer).then(function (decodedAudioData) {
    myBuffer = decodedAudioData; // AudioBuffer
    soundSource.buffer = myBuffer;
    offlineAudioContext.startRendering().then(function (renderedBuffer) {
      console.log(renderedBuffer); // Outputs a new AudioBuffer by re-rendering.
    }).catch(function (err) {
      console.log('Rendering failed: ' + err);
Re-rendering the decoded sound data with different values