Want to know what draws me to soundscape analysis?
It’s a field that combines science, creativity, and exploration in a way few others do. First of all, your laboratory is wherever your feet take you — a forest trail, a city park, or a remote mountain path can all become spaces for scientific discovery and acoustic investigation. Secondly, monitoring a chosen geographic area is all about creativity. Innovation is at the heart of environmental audio research, whether it’s rigging up a custom device, hiding sensors in tree canopies, or using solar power for off-grid setups. Finally, the sheer volume of data is truly incredible, and as we know, in spatial analysis, all methods are fair game. From hours of animal calls to the subtle hum of urban machinery, the acoustic data collected can be vast and complex, and that opens the door to using everything from deep learning to geographical information systems (GIS) in making sense of it all.
After my earlier adventures with soundscape analysis of one of Poland’s rivers, I decided to raise the bar and design and implement a solution capable of analysing soundscapes in real time. In this blog post, you’ll find a description of the proposed method, along with some code that powers the entire process, mainly using an Audio Spectrogram Transformer (AST) for sound classification.
Methods
Setup
There are many reasons why, in this particular case, I chose to use a combination of Raspberry Pi 4 and AudioMoth. Believe me, I tested a wide range of devices — from less power-hungry models of the Raspberry Pi family, through various Arduino versions, including the Portenta, all the way to the Jetson Nano. And that was just the beginning. Choosing the right microphone turned out to be even more complicated.
Ultimately, I went with the Pi 4 B (4GB RAM) because of its solid performance and relatively low power consumption (~700mAh when running my code). Additionally, pairing it with the AudioMoth in USB microphone mode gave me a lot of flexibility during prototyping. AudioMoth is a powerful device with a wealth of configuration options, e.g. sampling rate from 8 kHz to stunning 384 kHz. I have a strong feeling that — in the long run — this will prove to be a perfect choice for my soundscape studies.
Capturing sound
Capturing audio from a USB microphone using Python turned out to be surprisingly troublesome. After struggling with various libraries for a while, I decided to fall back on the good old Linux arecord
. The whole sound capture mechanism is encapsulated with the following command:
arecord -d 1 -D plughw:0,7 -f S16_LE -r 16000 -c 1 -q /tmp/audio.wav
I’m deliberately using a plug-in device to enable automatic conversion in case I would like to introduce any changes to the USB microphone configuration. AST is run on 16 kHz samples, so the recording and AudioMoth sampling are set to this value.
Pay attention to the generator in the code. It’s important that the device continuously captures audio at the time intervals I specify. I aimed to store only the most recent audio sample on the device and discard it after the classification. This approach will be especially useful later during larger-scale studies in urban areas, as it helps ensure people’s privacy and aligns with GDPR compliance.
import asyncio
import re
import subprocess
from tempfile import TemporaryDirectory
from typing import Any, AsyncGenerator
import librosa
import numpy as np
class AudioDevice:
def __init__(
self,
name: str,
channels: int,
sampling_rate: int,
format: str,
):
self.name = self._match_device(name)
self.channels = channels
self.sampling_rate = sampling_rate
self.format = format
@staticmethod
def _match_device(name: str):
lines = subprocess.check_output(['arecord', '-l'], text=True).splitlines()
devices = [
f'plughw:{m.group(1)},{m.group(2)}'
for line in lines
if name.lower() in line.lower()
if (m := re.search(r'card (\d+):.*device (\d+):', line))
]
if len(devices) == 0:
raise ValueError(f'No devices found matching `{name}`')
if len(devices) > 1:
raise ValueError(f'Multiple devices found matching `{name}` -> {devices}')
return devices[0]
async def continuous_capture(
self,
sample_duration: int = 1,
capture_delay: int = 0,
) -> AsyncGenerator[np.ndarray, Any]:
with TemporaryDirectory() as temp_dir:
temp_file = f'{temp_dir}/audio.wav'
command = (
f'arecord '
f'-d {sample_duration} '
f'-D {self.name} '
f'-f {self.format} '
f'-r {self.sampling_rate} '
f'-c {self.channels} '
f'-q '
f'{temp_file}'
)
while True:
subprocess.check_call(command, shell=True)
data, sr = librosa.load(
temp_file,
sr=self.sampling_rate,
)
await asyncio.sleep(capture_delay)
yield data
Classification
Now for the most exciting part.
Using the Audio Spectrogram Transformer (AST) and the excellent HuggingFace ecosystem, we can efficiently analyse audio and classify detected segments into over 500 categories.
Note that I’ve prepared the system to support various pre-trained models. By default, I use MIT/ast-finetuned-audioset-10–10–0.4593, as it delivers the best results and runs well on the Raspberry Pi 4. However, onnx-community/ast-finetuned-audioset-10–10–0.4593-ONNX is also worth exploring — especially its quantised version, which requires less memory and serves the inference results quicker.
You may notice that I’m not limiting the model to a single classification label, and that’s intentional. Instead of assuming that only one sound source is present at any given time, I apply a sigmoid function to the model’s logits to obtain independent probabilities for each class. This allows the model to express confidence in multiple labels simultaneously, which is crucial for real-world soundscapes where overlapping sources — like birds, wind, and distant traffic — often occur together. Taking the top five results ensures that the system captures the most likely sound events in the sample without forcing a winner-takes-all decision.
from pathlib import Path
from typing import Optional
import numpy as np
import pandas as pd
import torch
from optimum.onnxruntime import ORTModelForAudioClassification
from transformers import AutoFeatureExtractor, ASTForAudioClassification
class AudioClassifier:
def __init__(self, pretrained_ast: str, pretrained_ast_file_name: Optional[str] = None):
if pretrained_ast_file_name and Path(pretrained_ast_file_name).suffix == '.onnx':
self.model = ORTModelForAudioClassification.from_pretrained(
pretrained_ast,
subfolder='onnx',
file_name=pretrained_ast_file_name,
)
self.feature_extractor = AutoFeatureExtractor.from_pretrained(
pretrained_ast,
file_name=pretrained_ast_file_name,
)
else:
self.model = ASTForAudioClassification.from_pretrained(pretrained_ast)
self.feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_ast)
self.sampling_rate = self.feature_extractor.sampling_rate
async def predict(
self,
audio: np.array,
top_k: int = 5,
) -> pd.DataFrame:
with torch.no_grad():
inputs = self.feature_extractor(
audio,
sampling_rate=self.sampling_rate,
return_tensors='pt',
)
logits = self.model(**inputs).logits[0]
proba = torch.sigmoid(logits)
top_k_indices = torch.argsort(proba)[-top_k:].flip(dims=(0,)).tolist()
return pd.DataFrame(
{
'label': [self.model.config.id2label[i] for i in top_k_indices],
'score': proba[top_k_indices],
}
)
To run the ONNX version of the model, you need to add Optimum to your dependencies.
Sound pressure level
Along with the audio classification, I capture information on sound pressure level. This approach not only identifies what made the sound but also gains insight into how strongly each sound was present. In that way, the model captures a richer, more realistic representation of the acoustic scene and can eventually be used to detect finer-grained noise pollution information.
import numpy as np
from maad.spl import wav2dBSPL
from maad.util import mean_dB
async def calculate_sound_pressure_level(audio: np.ndarray, gain=10 + 15, sensitivity=-18) -> np.ndarray:
x = wav2dBSPL(audio, gain=gain, sensitivity=sensitivity, Vadc=1.25)
return mean_dB(x, axis=0)
The gain (preamp + amp), sensitivity (dB/V), and Vadc (V) are set primarily for AudioMoth and confirmed experimentally. If you are using a different device, you must identify these values by referring to the technical specification.
Storage
Data from each sensor is synchronised with a PostgreSQL database every 30 seconds. The current urban soundscape monitor prototype uses an Ethernet connection; therefore, I am not restricted in terms of network load. The device for more remote areas will synchronise the data each hour using a GSM connection.
label score device sync_id sync_time
Hum 0.43894055 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Mains hum 0.3894045 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Static 0.06389702 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Buzz 0.047603738 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
White noise 0.03204195 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Bee, wasp, etc. 0.40881288 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Fly, housefly 0.38868183 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Insect 0.35616025 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Speech 0.23579548 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Buzz 0.105577625 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Results
A separate application, built using Streamlit and Plotly, accesses this data. Currently, it displays information about the device’s location, temporal SPL (sound pressure level), identified sound classes, and a range of acoustic indices.
And now we are good to go. The plan is to extend the sensor network and reach around 20 devices scattered around multiple places in my city. More information about a larger area sensor deployment will be available soon.
Moreover, I’m collecting data from a deployed sensor and plan to share the data package, dashboard, and analysis in an upcoming blog post. I’ll use an interesting approach that warrants a deeper dive into audio classification. The main idea is to match different sound pressure levels to the detected audio classes. I hope to find a better way of describing noise pollution. So stay tuned for a more detailed breakdown soon.
In the meantime, you can read the preliminary paper on my soundscapes studies (headphones are obligatory).
This post was proofread and edited using Grammarly to improve grammar and clarity.
Source link
#Audio #Spectrogram #Transformers #Lab