Photo by Rubaitul Azad on Unsplash

In this article, we will be discussing the utilization of Python to run the basic version of Google Cloud’s Speech-to-Text API and Cloud Translation API, achieving real-time language translation effects.

First and foremost, this demo captures audio directly from the computer, uses the Speech-to-Text API to transcribe the audio content into text in real-time, and finally translates the text into our target language using the Cloud Translation API. We can set the desired language for translation, and in this demo, we have set the translation from English to Chinese. Moving forward, we will be discussing the essential functionalities of Speech-to-Text and Translation API.

𝟭 Speech-to-Text API:

The basic version of this API has three main structures, which are RecognizeRequests, RecognizeResponse, and StreamRequest and Response. RecognizeRequests are used to send settings and data to the API for processing. Custom settings can be applied, such as speech-to-text, video-to-text, or file-to-text conversion, including formats like MP4, MP3, WAV, and others. Therefore, optimal results can be achieved by setting according to the user’s needs. After a successful API call, a response is returned, which includes the transcript and confidence level.

In this article, we will mainly focus on the StreamRequestandResponse parameter, as it is specifically designed for real-time live streaming applications.

Main Configuration Parameters

There are numerous parameters outlined in the official documentation, but this section will highlight five parameters that I consider crucial:

  • encoding
  • sample_rate_hertz
  • language_code
  • model
  • profanity_filter

Encoding refers to the method we choose to upload to the API. Sample_rate_hertz is used to set the audio received by our microphone. Language_code refers to the language, such as Chinese, English, Spanish, etc., which can be set to the desired language. Model can be set to the preferred model, as different models may produce different results. For example, longer conversations, shorter commands, voice extracted from phone calls, etc. Profanity_filter is an interesting parameter that filters out sensitive terms, such as profanity and discriminatory language.

Core Settings of Request:

Single Utterance: This setting detects voice input and automatically ends the request if no input is detected.

It is typically used for voice commands and is not suitable for live streaming since the speaker is continuously speaking. Therefore, it is set to False to avoid ending the request when the speaker pauses.

Interim Result: When speaking, the API returns temporary text which is later refined into a complete sentence after processing more results. To determine if it is the final result, the API provides a flag, “is_final”.

In this case, “is_final” indicates whether the temporary or final result should be displayed. If “is_final” is False, it represents a temporary result, and when “is_final” is True, it indicates the final result.

𝟮 Cloud Translation API

The Cloud Translation API basic version utilizes Neural Machine Translation (NMT) models for translation. The advanced version is suitable for larger content such as entire documents, offering batch translation, model customization, and the ability to customize vocabulary for proprietary terms. Additionally, there is the Media Translation API, which allows for live speech translation. While this API is still in beta and has not been officially launched, the user experience is not yet optimal. Currently, it only supports English-to-Chinese translation and has a long wait time for the speech to end before sending a flag.

In the following paragraph, I will share the difficulties I encountered while learning APIs, which might also be encountered by others. Firstly, there are two factors that affect the latency that impacts our translation speed:

🦥 Latency Issue:

In terms of sampling rate, Google recommends setting it to 16000 Hz to achieve the best result, or setting it to the speaker’s voice frequency. The stream will divide the speech into frames and send them to the Request. The size of frames will affect the latency, and the larger the frames, the greater the latency. Google recommends setting the frame size to 100 milliseconds, as demonstrated in the following demo.

Slow Refresh Rate Issue:
Since our interim result is only temporary, the final result will be provided when the sentence is completed. When the speaker stops talking, the “isFinal” flag is automatically set to true, and the maximum waiting time for the final result will not exceed one second.

Limitations of Streaming Request:
For content transfers, a single request is limited to a maximum of 10MB, which includes audio files, live streams, videos, and more. However, if the files are stored in Cloud Storage, there are no usage restrictions. Streaming Requests, on the other hand, are limited to a duration of five minutes. To overcome this limitation, we can send a new request just before the five-minute mark, thereby avoiding any restrictions.

DEMO

In this demo, we create a recognition configuration and set three fundamental parameters: encoding, Herz, and language. Then, we apply the streaming configuration as shown below.

1
2
3
4
5
6
7
8
9
10
# configure RecognitionConfig
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=SAMPLE_RATE,
language_code="en-US",
)

# passing RecognitionConfig into StreamingRecognitionConfig
streaming_config = speech.StreamingRecognitionConfig(
config=config, interim_results=True

Next, we generate an audio stream using the mic manager to produce audio frames. Once we generate audio segments, we package the speech content into a request and send it together with the configuration to the API. Upon receiving the request, the API will return a response.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
with mic_manager as stream:

while not stream.closed:
sys.stdout.write(YELLOW)
sys.stdout.write(
"\n" + str(STREAMING_LIMIT * stream.restart_counter) + ": NEW REQUEST\n"
)

stream.audio_input = []
audio_generator = stream.generator() # receive audio content

# passing audio contents to request
requests = (
speech.StreamingRecognizeRequest(audio_content=content)
for content in audio_generator
)
# calling API request and receive response
responses = client.streaming_recognize(streaming_config, requests)

# process the received responses for output purposes
listen_print_loop(responses, stream)

The purpose of the listen print loop is to determine the format of the output. In this string, we translate the text in the response into Chinese.

Finally, we add the translation part and set the target language to zh-TW, using the Neural Machine Translation (NMT) model, as shown in below.

1
2
3
4
5
6
7
# translate the transcript into Traditional Chinese using Cloud Translation 
# Basic edition
if isinstance(transcript, bytes):
transcript = transcript.decode("utf-8")

translation = translator.translate(transcript, target_language='zh-TW', model="nmt")
transcript = translation['translatedText']

To overcome the 5-minute live streaming limit, we deduct the start time from the current time. If it exceeds the streaming limit, it will be cut off as shown in below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
     if get_current_time() - stream.start_time > STREAMING_LIMIT:
stream.start_time = get_current_time()
break
```

Then, we reset the streaming time to zero and begin a new request as shown in below.

``` bash
if stream.result_end_time > 0:
stream.final_request_end_time = stream.is_final_end_time
stream.result_end_time = 0
stream.last_audio_input = []
stream.last_audio_input = stream.audio_input
stream.audio_input = []
stream.restart_counter = stream.restart_counter + 1

if not stream.last_transcript_was_final:
sys.stdout.write("\n")
# start new stream
stream.new_stream = True

Related Article

Transform Your Audio Files with Ease: Effortlessly Transcribe and Translate at Scale with Google Cloud

Code Source on Github

Github

👉 check out my blog