...

Audio Spectrogram Transformers Beyond the Lab


Want to know what draws me to soundscape analysis?

It’s a field that combines science, creativity, and exploration in a way few others do. First of all, your laboratory is wherever your feet take you — a forest trail, a city park, or a remote mountain path can all become spaces for scientific discovery and acoustic investigation. Secondly, monitoring a chosen geographic area is all about creativity. Innovation is at the heart of environmental audio research, whether it’s rigging up a custom device, hiding sensors in tree canopies, or using solar power for off-grid setups. Finally, the sheer volume of data is truly incredible, and as we know, in spatial analysis, all methods are fair game. From hours of animal calls to the subtle hum of urban machinery, the acoustic data collected can be vast and complex, and that opens the door to using everything from deep learning to geographical information systems (GIS) in making sense of it all.

After my earlier adventures with soundscape analysis of one of Poland’s rivers, I decided to raise the bar and design and implement a solution capable of analysing soundscapes in real time. In this blog post, you’ll find a description of the proposed method, along with some code that powers the entire process, mainly using an Audio Spectrogram Transformer (AST) for sound classification.

Device prototype
Outdoor/Urban version of the sensor prototype (image by author)

Methods

Setup

There are many reasons why, in this particular case, I chose to use a combination of Raspberry Pi 4 and AudioMoth. Believe me, I tested a wide range of devices — from less power-hungry models of the Raspberry Pi family, through various Arduino versions, including the Portenta, all the way to the Jetson Nano. And that was just the beginning. Choosing the right microphone turned out to be even more complicated.

Ultimately, I went with the Pi 4 B (4GB RAM) because of its solid performance and relatively low power consumption (~700mAh when running my code). Additionally, pairing it with the AudioMoth in USB microphone mode gave me a lot of flexibility during prototyping. AudioMoth is a powerful device with a wealth of configuration options, e.g. sampling rate from 8 kHz to stunning 384 kHz. I have a strong feeling that — in the long run — this will prove to be a perfect choice for my soundscape studies.

AudioMoth USB Microphone configuration app. Remember about flashing the device with the proper firmware before configuring.

Capturing sound

Capturing audio from a USB microphone using Python turned out to be surprisingly troublesome. After struggling with various libraries for a while, I decided to fall back on the good old Linux arecord. The whole sound capture mechanism is encapsulated with the following command:

arecord -d 1 -D plughw:0,7 -f S16_LE -r 16000 -c 1 -q /tmp/audio.wav

I’m deliberately using a plug-in device to enable automatic conversion in case I would like to introduce any changes to the USB microphone configuration. AST is run on 16 kHz samples, so the recording and AudioMoth sampling are set to this value.

Pay attention to the generator in the code. It’s important that the device continuously captures audio at the time intervals I specify. I aimed to store only the most recent audio sample on the device and discard it after the classification. This approach will be especially useful later during larger-scale studies in urban areas, as it helps ensure people’s privacy and aligns with GDPR compliance.

import asyncio
import re
import subprocess
from tempfile import TemporaryDirectory
from typing import Any, AsyncGenerator

import librosa
import numpy as np


class AudioDevice:
    def __init__(
        self,
        name: str,
        channels: int,
        sampling_rate: int,
        format: str,
    ):
        self.name = self._match_device(name)
        self.channels = channels
        self.sampling_rate = sampling_rate
        self.format = format

    @staticmethod
    def _match_device(name: str):
        lines = subprocess.check_output(['arecord', '-l'], text=True).splitlines()
        devices = [
            f'plughw:{m.group(1)},{m.group(2)}'
            for line in lines
            if name.lower() in line.lower()
            if (m := re.search(r'card (\d+):.*device (\d+):', line))
        ]

        if len(devices) == 0:
            raise ValueError(f'No devices found matching `{name}`')
        if len(devices) > 1:
            raise ValueError(f'Multiple devices found matching `{name}` -> {devices}')
        return devices[0]

    async def continuous_capture(
        self,
        sample_duration: int = 1,
        capture_delay: int = 0,
    ) -> AsyncGenerator[np.ndarray, Any]:
        with TemporaryDirectory() as temp_dir:
            temp_file = f'{temp_dir}/audio.wav'
            command = (
                f'arecord '
                f'-d {sample_duration} '
                f'-D {self.name} '
                f'-f {self.format} '
                f'-r {self.sampling_rate} '
                f'-c {self.channels} '
                f'-q '
                f'{temp_file}'
            )

            while True:
                subprocess.check_call(command, shell=True)
                data, sr = librosa.load(
                    temp_file,
                    sr=self.sampling_rate,
                )
                await asyncio.sleep(capture_delay)
                yield data

Classification

Now for the most exciting part.

Using the Audio Spectrogram Transformer (AST) and the excellent HuggingFace ecosystem, we can efficiently analyse audio and classify detected segments into over 500 categories.
Note that I’ve prepared the system to support various pre-trained models. By default, I use MIT/ast-finetuned-audioset-10–10–0.4593, as it delivers the best results and runs well on the Raspberry Pi 4. However, onnx-community/ast-finetuned-audioset-10–10–0.4593-ONNX is also worth exploring — especially its quantised version, which requires less memory and serves the inference results quicker.

You may notice that I’m not limiting the model to a single classification label, and that’s intentional. Instead of assuming that only one sound source is present at any given time, I apply a sigmoid function to the model’s logits to obtain independent probabilities for each class. This allows the model to express confidence in multiple labels simultaneously, which is crucial for real-world soundscapes where overlapping sources — like birds, wind, and distant traffic — often occur together. Taking the top five results ensures that the system captures the most likely sound events in the sample without forcing a winner-takes-all decision.

from pathlib import Path
from typing import Optional

import numpy as np
import pandas as pd
import torch
from optimum.onnxruntime import ORTModelForAudioClassification
from transformers import AutoFeatureExtractor, ASTForAudioClassification


class AudioClassifier:
    def __init__(self, pretrained_ast: str, pretrained_ast_file_name: Optional[str] = None):
        if pretrained_ast_file_name and Path(pretrained_ast_file_name).suffix == '.onnx':
            self.model = ORTModelForAudioClassification.from_pretrained(
                pretrained_ast,
                subfolder='onnx',
                file_name=pretrained_ast_file_name,
            )
            self.feature_extractor = AutoFeatureExtractor.from_pretrained(
                pretrained_ast,
                file_name=pretrained_ast_file_name,
            )
        else:
            self.model = ASTForAudioClassification.from_pretrained(pretrained_ast)
            self.feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_ast)

        self.sampling_rate = self.feature_extractor.sampling_rate

    async def predict(
        self,
        audio: np.array,
        top_k: int = 5,
    ) -> pd.DataFrame:
        with torch.no_grad():
            inputs = self.feature_extractor(
                audio,
                sampling_rate=self.sampling_rate,
                return_tensors='pt',
            )
            logits = self.model(**inputs).logits[0]
            proba = torch.sigmoid(logits)
            top_k_indices = torch.argsort(proba)[-top_k:].flip(dims=(0,)).tolist()

            return pd.DataFrame(
                {
                    'label': [self.model.config.id2label[i] for i in top_k_indices],
                    'score': proba[top_k_indices],
                }
            )

To run the ONNX version of the model, you need to add Optimum to your dependencies.

Sound pressure level

Along with the audio classification, I capture information on sound pressure level. This approach not only identifies what made the sound but also gains insight into how strongly each sound was present. In that way, the model captures a richer, more realistic representation of the acoustic scene and can eventually be used to detect finer-grained noise pollution information.

import numpy as np
from maad.spl import wav2dBSPL
from maad.util import mean_dB


async def calculate_sound_pressure_level(audio: np.ndarray, gain=10 + 15, sensitivity=-18) -> np.ndarray:
    x = wav2dBSPL(audio, gain=gain, sensitivity=sensitivity, Vadc=1.25)
    return mean_dB(x, axis=0)

The gain (preamp + amp), sensitivity (dB/V), and Vadc (V) are set primarily for AudioMoth and confirmed experimentally. If you are using a different device, you must identify these values by referring to the technical specification.

Storage

Data from each sensor is synchronised with a PostgreSQL database every 30 seconds. The current urban soundscape monitor prototype uses an Ethernet connection; therefore, I am not restricted in terms of network load. The device for more remote areas will synchronise the data each hour using a GSM connection.

label           score        device   sync_id                                sync_time
Hum             0.43894055   yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
Mains hum       0.3894045    yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
Static          0.06389702   yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
Buzz            0.047603738  yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
White noise     0.03204195   yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
Bee, wasp, etc. 0.40881288   yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071
Fly, housefly   0.38868183   yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071
Insect          0.35616025   yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071
Speech          0.23579548   yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071
Buzz            0.105577625  yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071

Results

A separate application, built using Streamlit and Plotly, accesses this data. Currently, it displays information about the device’s location, temporal SPL (sound pressure level), identified sound classes, and a range of acoustic indices.

Dashboard
Streamit analytical dashboard (image by author)

And now we are good to go. The plan is to extend the sensor network and reach around 20 devices scattered around multiple places in my city. More information about a larger area sensor deployment will be available soon.

Moreover, I’m collecting data from a deployed sensor and plan to share the data package, dashboard, and analysis in an upcoming blog post. I’ll use an interesting approach that warrants a deeper dive into audio classification. The main idea is to match different sound pressure levels to the detected audio classes. I hope to find a better way of describing noise pollution. So stay tuned for a more detailed breakdown soon.

In the meantime, you can read the preliminary paper on my soundscapes studies (headphones are obligatory).


This post was proofread and edited using Grammarly to improve grammar and clarity.

Source link

#Audio #Spectrogram #Transformers #Lab