slide

Using azure speech service with power shell

Ned Bellavance
5 min read

Cover

The 100th episode of Buffer Overflow - a weekly tech news podcast I host - is steadily approaching. As I write this, we are getting ready to record episode 98. In preparation for the 100th episode, I thought it might be nice to look over past episodes and find some common themes, running gags, and anything else that caught my eye. At an average of 35 minutes, that’s roughly 57 hours of combined audio. There’s no way I could listen to the entirety of the episodes, and so I started thinking. What if I could transcribe the audio to text, and then search through the text to find all the times we talked about Derrick and Miranda, how we’re all doomed, or smiling poop? The Azure Speech to Text API can be used to convert speech to text of audio files. Why not start there?

When I decided to try and use the Azure Speech Service, I initially assumed that I would be able to upload the files to Azure Storage and point the service at all my MP3s. Then it would dutifully transcribe all the files and dump out the transcriptions in some file repository. That is not exactly what the Speech Service does. The service itself can be used in concert with the Speech SDK to convert snippets about 15 seconds long. The point of that is to integrate speech to text in your applications. That’s not what I was looking to do. There is also a Rest API that supports batch processing, allowing you to send a request to the transcription endpoint with a file stored in Azure Storage. The batch process will transcribe the file, and then you can retrieve the results using another call to the API. That is more like it! Let’s do that.

Except, the Rest API is just an API, not a GUI or a menu. And there’s no PowerShell module for it. If you want to use the Speech Service SDK, there are examples on GitHub. The batch service examples in particular are only available in C#. I haven’t used C# in about six years. But I use PowerShell a lot! And PowerShell can talk to a Rest API using Invoke-WebRequest or Invoke-RestMethod. Why not just write the whole thing with PowerShell functions? And that is exactly what I did.

There are six functions that compose what I needed.

  • Get-AzSSBatchStatus: retrieves the current status of a particular transcript request ID
  • Get-AzSSBatchResults: retrieves the JSON result files from a successful transcription
  • New-AzSSBatchRequest: creates a new transcription request, and returns the URI of the request
  • Remove-AzSSBatchRequest: removes a completed transcription request, whether is was successful or not
  • New-AzStorageSASTokenAllBlobs: creates a SAS token for every blob item in a container on an Azure Storage Account, required for the Speech Service to access the blob
  • New-AzSSMultiBatchRequest: creates a transcription request for every blob in a container, waits for each to complete, and saves result files to a directory

The New-AzSSMultiBatchRequest is the function that leverages most of the other functions to get its work done. Essentially, you upload all your audio files to a blob storage container. Then run the New-AzSSMultiBatchRequest function and pass it the storage account information, the speech service information, and the destination directory for the results. Each blob is submitted to the speech service and then tracked to see if it is successful or fails. The successful items will have their JSON results files written out to the destination directory. The failed items will be reported in the final counts.

I chose to use the abbreviation AzSS for the Azure Speech Service. I thought maybe I would turn this into a full blown PowerShell module, with support for each function in the Rest API as defined in the Swagger document. Then I thought to myself, well if there’s a Swagger document wouldn’t it be possible to automatically develop a PowerShell module based on a Swagger spec? And then I checked and discovered that people have already written a project that does exactly that. At that point I realized that I probably could have saved a lot of time and used that to create four out of the six functions in my PowerShell project.

But you know what? I had fun writing the functions, as weird as that might sound. As someone who spends entirely too much time in PowerPoint instead of PowerShell and Microsoft Word instead of VS Code, it was really nice to just get into the zone and write some half-decent PowerShell scripts. And I learned some more about reading Swagger docs and using the Invoke-WebRequest and Invoke-RestMethod cmdlets. Ultimately, it was a worthwhile exercise. And if you have a need similar to mine, you are welcome to take my scripts and use them to your heart’s delight.

Of course now that I’ve run some transcription jobs, I have quickly realized how bad the default language model is. That’s especially true in a tech podcast that is full of jargon, acronyms, and strange proper nouns. Now I’m thinking I need to train a model with some pre-transcribed audio files. I’ll report back later on how that effort is going.