Node.js - Google Speech-to-Text

Speech recognition is the process of getting the transcription of an audio source. Sometimes you may need an automated way to 'convert' an audio file into a text. There are some services providing speech-to-text recognition services, one of them is provided by Google as a part of their cloud platform services. While in other tutorial I had written about using Google Text-to-Speech in Node.js, this tutorial is the opposite. I'm going to show you how to use Google Speech-to-Text API for transcribing audio file into text, also in Node.js

Preparation

1. Create or select a Google Cloud project

A Google Cloud project is required to use this service. Open Google Cloud console, then create a new project or select existing project

2. Enable billing for the project

Like other cloud platforms, Google requires you to enable billing for your project. If you haven't set up billing, open billing page.

3. Enable Google Speech API

To use an API, you must enable it first. Open this page to enable Speech API.

4. Set up service account for authentication

As for authentication, you need to create a new service account. Create a new one on the service account management page and download the credentials, or you can use your already created service account.

In your .env file, you have to add a new variable

GOOGLE_APPLICATION_CREDENTIALS=/path/to/the/credentials

The .env file should be loaded of course, so you need to use a module for reading .env such as dotenv.

Dependencies

This tutorial uses @google-cloud/speech. @google-cloud/storage is also required for uploading large audio files. Add the following dependencies to your package.json and run npm install

  "@google-cloud/speech": "~2.0.0"
  "@google-cloud/storage": "~1.7.0"
  "dotenv": "~4.0.0"
  "lodash": "~4.17.10"

Supported Audio Encodings

Not all audio encoding supported by Google Speech. Below is the list of supported audio encodings.

  • LINEAR16
  • FLAC
  • MULAW
  • AMR
  • AMR_WB
  • OGG_OPUS
  • SPEEX_WITH_HEADER_BYTE

For best results, the audio source should use lossless encoding (FLAC or LINEAR16). If the audio source has lossy codec (including on the list above other than those two recommended formats), recognition accuracy may be reduced.

1. Sync Recognize

If the audio file you want to transcribe is less than ~1 minute, you can use synchronize recognition. You'll get the result directly in the response.

  require('dotenv').config();
  
  const _ = require('lodash');
  const speech = require('@google-cloud/speech');
  const fs = require('fs');
  
  // Creates a client
  const speechClient = new speech.SpeechClient();
  
  // The path to the audio file to transcribe
  const filePath = 'input.wav';
  
  // Reads a local audio file and converts it to base64
  const file = fs.readFileSync(filePath);
  const audioBytes = file.toString('base64');
  const audio = {
    content: audioBytes,
  };
  
  // The audio file's encoding, sample rate in hertz, and BCP-47 language code
  const config = {
    encoding: 'LINEAR16',
    sampleRateHertz: 24000,
    languageCode: 'en-US',
  };
  
  const request = {
    audio,
    config,
  };

  // Detects speech in the audio file
  speechClient
    .recognize(request)
    .then((data) => {
      const results = _.get(data[0], 'results', []);
      const transcription = results
        .map(result => result.alternatives[0].transcript)
        .join('\n');
      console.log(`Transcription: ${transcription}`);
    })
    .catch(err => {
      console.error('ERROR:', err);
    });
  

2. Long Running Recognize (Async Recognize)

If the duration of the audio file is longer than 1 minute, you have to use asynchronous recognition which has a limitation of ~180 minute. The file must be uploaded to Google Cloud Storage first. If you haven't use Google Cloud Storage, you can read this tutorial first. Then, you can use a special URI to refer to the uploaded file gs://{bucket-name}/{file-name}. You'll get a Promise representing the final result of the job.

  require('dotenv').config();
  
  const _ = require('lodash');
  const speech = require('@google-cloud/speech');
  const cloudStorage = require('@google-cloud/storage');
  const fs = require('fs');
  const path = require('path');
  
  const speechClient = new speech.SpeechClient();
  
  // The path to the audio file to transcribe
  const filePath = 'input.wav';
  
  // Google Cloud storage
  const bucketName = 'gcs-demo-bucket'; // Must exist in your Cloud Storage
  
  const uploadToGcs = async () => {
    const storage = cloudStorage({
      projectId: process.env.GOOGLE_CLOUD_PROJECT_ID,
    });
  
    const bucket = storage.bucket(bucketName);
    const fileName = path.basename(filePath);
  
    await bucket.upload(filePath);

    return `gs://${bucketName}/${fileName}`;
  };
  
  // Upload to Cloud Storage first, then detects speech in the audio file
  uploadToGcs()
    .then(async (gcsUri) => {
      const audio = {
        uri: gcsUri,
      };
  
      const config = {
        encoding: 'LINEAR16',
        sampleRateHertz: 24000,
        languageCode: 'en-US',
      };
  
      const request = {
        audio,
        config,
      };
  
      speechClient.longRunningRecognize(request)
        .then((data) => {
          const operation = data[0];
  
          // The following Promise represents the final result of the job
          return operation.promise();
        })
        .then((data) => {
          const results = _.get(data[0], 'results', []);
          const transcription = results
            .map(result => result.alternatives[0].transcript)
            .join('\n');
          console.log(`Transcription: ${transcription}`);
        })
    })
    .catch(err => {
      console.error('ERROR:', err);
    });

That's all about how to transcribe audio source using Google Speech API in Node.js.