Audio Analyser: Azure Speech, NLP, and Translation Pipeline

TL;DR. Audio Analyser uses Azure Cognitive Services speech-to-text neural models, Text Analytics NLP, and CherryPy to convert audio recordings into searchable transcripts with sentiment scores, keyword extraction, and multilingual translations.

Key takeaways

How Audio Analyser Works: Architecture Overview. The pipeline has five discrete stages:.
Azure Cognitive Services: The Batch Transcription Engine. The Azure Speech service batch transcription API (/speechtotext/v3.0/transcriptions) accepts a reference to an audio file in Azure Blob Storage and a configuration JSON body.
Natural Language Processing with Azure AI Language. After transcription, Audio Analyser sends the display-form transcript to Azure AI Language via the azure-ai-textanalytics Python SDK:.
Multilingual Support via Azure Translator. Azure Translator is invoked after language detection when the user requests a target language.

Executive Summary / Key Takeaways

Azure Batch Transcription API accepts audio files up to 2.5 hours (WAV/MP3/OGG/FLAC), processes them asynchronously, and returns a recognizedPhrases JSON array with per-phrase nBest candidates, confidence scores, inverse-text-normalised (ITN) output, and optional speaker diarisation — no streaming connection required (Microsoft Azure, 2024).

Microsoft's neural acoustic models reduced word error rate by approximately 50% relative to earlier hidden Markov model (HMM) baselines on the Switchboard conversational benchmark, reaching parity with professional human transcribers on that dataset at ~5.1% WER (Xiong et al., Microsoft Research, 2016/2021 update).

Azure Text Analytics (now part of Azure AI Language) processes transcript text through key phrase extraction, named entity recognition (NER), sentiment analysis with opinion mining, and language detection — all in a single analyze_sentiment or begin_analyze_actions call using the Python SDK.

CherryPy provides the web layer: URL routing, multipart upload handling, session management, and Jinja2 template rendering in a minimal Python process that can run on a single low-cost VM without orchestration overhead.

Azure Translator NMT auto-detects the source language and translates transcripts into any of 135 target languages, enabling downstream NLP analysis on both original and translated text within the same pipeline run.

Audio Analyser ⧉ is an open-source Python application that connects three Azure Cognitive Services into a single workflow: Batch Transcription for speech-to-text, Azure AI Language (Text Analytics) for NLP, and Azure Translator for multilingual output. The web interface is served by CherryPy, and results can be persisted to JSON, plain text, or a local SQLite database.

This article describes the technical architecture of each pipeline stage, the Azure API contracts, and the design choices made in the CherryPy layer.

How Audio Analyser Works: Architecture Overview #

The pipeline has five discrete stages:

Upload — the user submits an audio file through the CherryPy web interface. CherryPy stores the file in a temporary directory and returns a job ID.
Transcription — Audio Analyser submits the file to the Azure Batch Transcription REST API. Because batch transcription is asynchronous, the application polls the job status endpoint at intervals and waits for the Succeeded state before proceeding.
NLP — the raw transcript text is passed to Azure AI Language for key phrase extraction, NER, sentiment analysis, and language detection.
Translation (optional) — if a target language is specified, the transcript is sent to Azure Translator, and NLP analysis is re-run on the translated text.
Output — results are written to the selected output format (JSON, TXT, or SQLite) and rendered in the CherryPy web UI.

The only runtime dependencies outside the Python standard library are azure-cognitiveservices-speech, azure-ai-textanalytics, azure-ai-translation-text, and cherrypy. All Azure credentials are read from environment variables.

Azure Cognitive Services: The Batch Transcription Engine #

The Azure Speech service batch transcription API (/speechtotext/v3.0/transcriptions) accepts a reference to an audio file in Azure Blob Storage and a configuration JSON body. Audio Analyser uploads the local file to Blob Storage using a pre-signed SAS URL, then submits the transcription job.

A minimal job submission payload:

{
  "contentUrls": ["https://<account>.blob.core.windows.net/<container>/<file>.wav?<sas>"],
  "locale": "en-US",
  "displayName": "audio-analyser-job-001",
  "properties": {
    "diarizationEnabled": true,
    "wordLevelTimestampsEnabled": true,
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked"
  }
}

The response recognizedPhrases array contains one object per recognised utterance. Each entry includes:

nBest[0].confidence — float between 0 and 1
nBest[0].lexical — raw words as spoken
nBest[0].itn — inverse-text-normalised form (numbers, dates, currencies expanded)
nBest[0].display — formatted for reading, with punctuation
speaker — integer speaker ID when diarisation is enabled

Custom Speech fine-tuning is available for domain-specific vocabulary. Uploading a pronunciation lexicon or adaptation corpus (a set of text sentences representative of the domain) adjusts the language model and can substantially reduce WER on specialised content such as financial terms or medical jargon.

Natural Language Processing with Azure AI Language #

After transcription, Audio Analyser sends the display-form transcript to Azure AI Language via the azure-ai-textanalytics Python SDK:

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

client = TextAnalyticsClient(
    endpoint=os.environ["AZURE_LANGUAGE_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_LANGUAGE_KEY"])
)

documents = [{"id": "1", "language": detected_lang, "text": transcript}]

sentiment_result = client.analyze_sentiment(documents, show_opinion_mining=True)
for doc in sentiment_result:
    print(f"Sentiment: {doc.sentiment}")
    print(f"Scores: pos={doc.confidence_scores.positive:.2f} "
          f"neg={doc.confidence_scores.negative:.2f} "
          f"neu={doc.confidence_scores.neutral:.2f}")
    for sentence in doc.sentences:
        for opinion in sentence.mined_opinions:
            print(f"  Target: {opinion.target.text}, "
                  f"Assessment: {[a.text for a in opinion.assessments]}")

keyphrases_result = client.extract_key_phrases(documents)
entities_result  = client.recognize_entities(documents)

show_opinion_mining=True enables aspect-level sentiment: the API returns not just document-level polarity but specific target–assessment pairs (e.g., target="audio quality", assessment="poor"). This makes the output useful for identifying concrete issues in customer service call analysis.

Named entity recognition classifies spans as one of: Person, Organization, Location, Event, Product, DateTime, Quantity, IP, URL, Email, PersonType, Skill, Address, PhoneNumber.

Multilingual Support via Azure Translator #

Azure Translator is invoked after language detection when the user requests a target language. The service supports 135 languages and dialects with neural machine translation (NMT). Audio Analyser uses the /translate REST endpoint with autodetect as the from parameter, so no source-language specification is required:

import requests, uuid

url = "https://api.cognitive.microsofttranslator.com/translate"
params = {"api-version": "3.0", "to": target_lang}
headers = {
    "Ocp-Apim-Subscription-Key": os.environ["AZURE_TRANSLATOR_KEY"],
    "Ocp-Apim-Subscription-Region": os.environ["AZURE_TRANSLATOR_REGION"],
    "Content-type": "application/json",
    "X-ClientTraceId": str(uuid.uuid4())
}
body = [{"text": transcript}]
response = requests.post(url, params=params, headers=headers, json=body)
translated_text = response.json()[0]["translations"][0]["text"]
detected_language = response.json()[0]["detectedLanguage"]["language"]

After translation, Audio Analyser optionally re-runs the Text Analytics NLP pass on the translated text so that key phrase and sentiment outputs are available in both the source and target languages.

Output format selection (JSON, TXT, SQLite) is set at startup. The SQLite output stores each analysis session as a row with columns for job ID, timestamp, source language, transcript, translated transcript, sentiment scores, and key phrases as a JSON blob — enabling SQL queries across sessions.

CherryPy as the Web Layer #

CherryPy maps URL routes to Python methods using class-based controllers. Audio Analyser uses three routes:

Route	Method	Description
`GET /`	`index()`	Renders the upload form
`POST /analyse`	`analyse()`	Accepts multipart upload, triggers pipeline, returns job ID
`GET /results/<job_id>`	`results()`	Polls job status; renders result page when complete

The minimal configuration keeps the server footprint small:

import cherrypy

cherrypy.config.update({
    "server.socket_host": "0.0.0.0",
    "server.socket_port": 8080,
    "tools.sessions.on": True,
    "tools.sessions.timeout": 60
})
cherrypy.quickstart(AudioAnalyserApp(), "/", conf)

Session state holds the current job ID, selected output format, and target translation language. CherryPy's built-in session storage is file-backed by default, requiring no external cache layer.

Frequently Asked Questions #

What audio formats and file sizes does Audio Analyser accept? The Azure Batch Transcription API supports WAV, MP3, OGG, and FLAC files up to 2.5 hours in length. Files outside this range should be split before upload. Stereo files are accepted; mono conversion is not required.

How does speaker diarisation work? Setting diarizationEnabled: true in the batch transcription request activates Azure's speaker separation model. Each recognizedPhrase in the response includes a speaker integer field. The model identifies speakers by acoustic characteristics and assigns consistent IDs within a session, but does not identify who speakers are without a separate voice profile enrolment step.

Are audio files retained after transcription? Audio files are uploaded to Azure Blob Storage with a short-lived SAS URL and deleted from the temporary local directory after the upload completes. Retention of blobs in Azure Blob Storage depends on the container's lifecycle policy; by default, Audio Analyser does not set an explicit deletion policy, so configuring a short TTL rule (e.g., delete blobs older than 1 day) in the Azure portal is recommended for production deployments.

Can the NLP analysis be run without translation? Yes. Translation is an optional pipeline stage controlled by the --target-lang CLI flag or the target language dropdown in the web UI. When no target language is selected, the pipeline runs speech-to-text and Text Analytics only.

References #

Microsoft. Batch transcription overview — Azure AI services. Microsoft Learn, 2024. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/batch-transcription
Xiong, W. et al. "Achieving Human Parity in Conversational Speech Recognition." Microsoft Research Technical Report, 2016; updated 2021. https://arxiv.org/abs/1610.05256
Microsoft. What is Azure AI Language? Microsoft Learn, 2024. https://learn.microsoft.com/en-us/azure/ai-services/language-service/overview
Microsoft. Azure AI Translator — Supported languages. Microsoft Learn, 2024. https://learn.microsoft.com/en-us/azure/ai-services/translator/language-support

Last reviewed 2026-07-28.

Republish this article

Audio Analyser: Azure Speech, NLP, and Translation Pipeline

Audio Analyser uses Azure Cognitive Services speech-to-text neural models, Text Analytics NLP, and CherryPy to convert audio recordings into searchable transcripts with sentiment scores, keyword extraction, and multilingual translations.

This article is licensed under Creative Commons Attribution 4.0 International. Republication requires attribution to the canonical URL.

Audio Analyser: Azure Speech, NLP, and Translation Pipeline

Audio Analyser uses Azure Cognitive Services speech-to-text neural models, Text Analytics NLP, and CherryPy to convert audio recordings into searchable transcripts with sentiment scores, keyword extraction, and multilingual translations.

Originally published at https://sebastienrousseau.com/2024-01-29-ai-powered-audio-insights-analysis-translations/ by Sebastien Rousseau.
Licensed under CC-BY-4.0.

SEBASTIEN ROUSSEAU FOUNDER · ENGINEER