Sebastien Rousseau

Audio Analyser: Azure Speech, NLP, and Translation Pipeline

Architecture and pipeline of an Azure-backed speech analytics tool

7 min read
Banner for: Audio Analyser: Azure Speech, NLP, and Translation Pipeline

Executive Summary / Key Takeaways

  • Azure Batch Transcription API accepts audio files up to 2.5 hours (WAV/MP3/OGG/FLAC), processes them asynchronously, and returns a recognizedPhrases JSON array with per-phrase nBest candidates, confidence scores, inverse-text-normalised (ITN) output, and optional speaker diarisation — no streaming connection required (Microsoft Azure, 2024).
  • Microsoft's neural acoustic models reduced word error rate by approximately 50% relative to earlier hidden Markov model (HMM) baselines on the Switchboard conversational benchmark, reaching parity with professional human transcribers on that dataset at ~5.1% WER (Xiong et al., Microsoft Research, 2016/2021 update).
  • Azure Text Analytics (now part of Azure AI Language) processes transcript text through key phrase extraction, named entity recognition (NER), sentiment analysis with opinion mining, and language detection — all in a single analyze_sentiment or begin_analyze_actions call using the Python SDK.
  • CherryPy provides the web layer: URL routing, multipart upload handling, session management, and Jinja2 template rendering in a minimal Python process that can run on a single low-cost VM without orchestration overhead.
  • Azure Translator NMT auto-detects the source language and translates transcripts into any of 135 target languages, enabling downstream NLP analysis on both original and translated text within the same pipeline run.

Audio Analyser ⧉ is an open-source Python application that connects three Azure Cognitive Services into a single workflow: Batch Transcription for speech-to-text, Azure AI Language (Text Analytics) for NLP, and Azure Translator for multilingual output. The web interface is served by CherryPy, and results can be persisted to JSON, plain text, or a local SQLite database.

This article describes the technical architecture of each pipeline stage, the Azure API contracts, and the design choices made in the CherryPy layer.

How Audio Analyser Works: Architecture Overview #

The pipeline has five discrete stages:

  1. Upload — the user submits an audio file through the CherryPy web interface. CherryPy stores the file in a temporary directory and returns a job ID.
  2. Transcription — Audio Analyser submits the file to the Azure Batch Transcription REST API. Because batch transcription is asynchronous, the application polls the job status endpoint at intervals and waits for the Succeeded state before proceeding.
  3. NLP — the raw transcript text is passed to Azure AI Language for key phrase extraction, NER, sentiment analysis, and language detection.
  4. Translation (optional) — if a target language is specified, the transcript is sent to Azure Translator, and NLP analysis is re-run on the translated text.
  5. Output — results are written to the selected output format (JSON, TXT, or SQLite) and rendered in the CherryPy web UI.

The only runtime dependencies outside the Python standard library are azure-cognitiveservices-speech, azure-ai-textanalytics, azure-ai-translation-text, and cherrypy. All Azure credentials are read from environment variables.

Azure Cognitive Services: The Batch Transcription Engine #

The Azure Speech service batch transcription API (/speechtotext/v3.0/transcriptions) accepts a reference to an audio file in Azure Blob Storage and a configuration JSON body. Audio Analyser uploads the local file to Blob Storage using a pre-signed SAS URL, then submits the transcription job.

A minimal job submission payload:

{
  "contentUrls": ["https://<account>.blob.core.windows.net/<container>/<file>.wav?<sas>"],
  "locale": "en-US",
  "displayName": "audio-analyser-job-001",
  "properties": {
    "diarizationEnabled": true,
    "wordLevelTimestampsEnabled": true,
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked"
  }
}

The response recognizedPhrases array contains one object per recognised utterance. Each entry includes:

Custom Speech fine-tuning is available for domain-specific vocabulary. Uploading a pronunciation lexicon or adaptation corpus (a set of text sentences representative of the domain) adjusts the language model and can substantially reduce WER on specialised content such as financial terms or medical jargon.

Natural Language Processing with Azure AI Language #

After transcription, Audio Analyser sends the display-form transcript to Azure AI Language via the azure-ai-textanalytics Python SDK:

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

client = TextAnalyticsClient(
    endpoint=os.environ["AZURE_LANGUAGE_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_LANGUAGE_KEY"])
)

documents = [{"id": "1", "language": detected_lang, "text": transcript}]

sentiment_result = client.analyze_sentiment(documents, show_opinion_mining=True)
for doc in sentiment_result:
    print(f"Sentiment: {doc.sentiment}")
    print(f"Scores: pos={doc.confidence_scores.positive:.2f} "
          f"neg={doc.confidence_scores.negative:.2f} "
          f"neu={doc.confidence_scores.neutral:.2f}")
    for sentence in doc.sentences:
        for opinion in sentence.mined_opinions:
            print(f"  Target: {opinion.target.text}, "
                  f"Assessment: {[a.text for a in opinion.assessments]}")

keyphrases_result = client.extract_key_phrases(documents)
entities_result  = client.recognize_entities(documents)

show_opinion_mining=True enables aspect-level sentiment: the API returns not just document-level polarity but specific target–assessment pairs (e.g., target="audio quality", assessment="poor"). This makes the output useful for identifying concrete issues in customer service call analysis.

Named entity recognition classifies spans as one of: Person, Organization, Location, Event, Product, DateTime, Quantity, IP, URL, Email, PersonType, Skill, Address, PhoneNumber.

Multilingual Support via Azure Translator #

Azure Translator is invoked after language detection when the user requests a target language. The service supports 135 languages and dialects with neural machine translation (NMT). Audio Analyser uses the /translate REST endpoint with autodetect as the from parameter, so no source-language specification is required:

import requests, uuid

url = "https://api.cognitive.microsofttranslator.com/translate"
params = {"api-version": "3.0", "to": target_lang}
headers = {
    "Ocp-Apim-Subscription-Key": os.environ["AZURE_TRANSLATOR_KEY"],
    "Ocp-Apim-Subscription-Region": os.environ["AZURE_TRANSLATOR_REGION"],
    "Content-type": "application/json",
    "X-ClientTraceId": str(uuid.uuid4())
}
body = [{"text": transcript}]
response = requests.post(url, params=params, headers=headers, json=body)
translated_text = response.json()[0]["translations"][0]["text"]
detected_language = response.json()[0]["detectedLanguage"]["language"]

After translation, Audio Analyser optionally re-runs the Text Analytics NLP pass on the translated text so that key phrase and sentiment outputs are available in both the source and target languages.

Output format selection (JSON, TXT, SQLite) is set at startup. The SQLite output stores each analysis session as a row with columns for job ID, timestamp, source language, transcript, translated transcript, sentiment scores, and key phrases as a JSON blob — enabling SQL queries across sessions.

CherryPy as the Web Layer #

CherryPy maps URL routes to Python methods using class-based controllers. Audio Analyser uses three routes:

Route Method Description
GET / index() Renders the upload form
POST /analyse analyse() Accepts multipart upload, triggers pipeline, returns job ID
GET /results/<job_id> results() Polls job status; renders result page when complete

The minimal configuration keeps the server footprint small:

import cherrypy

cherrypy.config.update({
    "server.socket_host": "0.0.0.0",
    "server.socket_port": 8080,
    "tools.sessions.on": True,
    "tools.sessions.timeout": 60
})
cherrypy.quickstart(AudioAnalyserApp(), "/", conf)

Session state holds the current job ID, selected output format, and target translation language. CherryPy's built-in session storage is file-backed by default, requiring no external cache layer.

Frequently Asked Questions #

What audio formats and file sizes does Audio Analyser accept? The Azure Batch Transcription API supports WAV, MP3, OGG, and FLAC files up to 2.5 hours in length. Files outside this range should be split before upload. Stereo files are accepted; mono conversion is not required.

How does speaker diarisation work? Setting diarizationEnabled: true in the batch transcription request activates Azure's speaker separation model. Each recognizedPhrase in the response includes a speaker integer field. The model identifies speakers by acoustic characteristics and assigns consistent IDs within a session, but does not identify who speakers are without a separate voice profile enrolment step.

Are audio files retained after transcription? Audio files are uploaded to Azure Blob Storage with a short-lived SAS URL and deleted from the temporary local directory after the upload completes. Retention of blobs in Azure Blob Storage depends on the container's lifecycle policy; by default, Audio Analyser does not set an explicit deletion policy, so configuring a short TTL rule (e.g., delete blobs older than 1 day) in the Azure portal is recommended for production deployments.

Can the NLP analysis be run without translation? Yes. Translation is an optional pipeline stage controlled by the --target-lang CLI flag or the target language dropdown in the web UI. When no target language is selected, the pipeline runs speech-to-text and Text Analytics only.

References #

  1. Microsoft. Batch transcription overview — Azure AI services. Microsoft Learn, 2024. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/batch-transcription
  2. Xiong, W. et al. "Achieving Human Parity in Conversational Speech Recognition." Microsoft Research Technical Report, 2016; updated 2021. https://arxiv.org/abs/1610.05256
  3. Microsoft. What is Azure AI Language? Microsoft Learn, 2024. https://learn.microsoft.com/en-us/azure/ai-services/language-service/overview
  4. Microsoft. Azure AI Translator — Supported languages. Microsoft Learn, 2024. https://learn.microsoft.com/en-us/azure/ai-services/translator/language-support

Last reviewed .