Crowdsourcing a Subjective Audio Dataset

Ever increasing end-user bandwidths have helped establish video as the dominant traffic over the Internet and have made applications in the live streaming and video conferencing space feasible. Aided by the covid-19 pandemic, live video streaming for both the broadcast and interactive use cases is one of the largest growing segments within video at this time as is evidenced by the large adoption of applications including Zoom, AWS Chime, Twitch and Facebook Live. To measure both return on investment and to make sound investment decisions, it is paramount that we are able to measure the media quality offered by these systems.

Audio quality degradations occur frequently in live streaming and video conferencing applications, and it can be the dominant influence on the viewer’s Quality of Experience (QoE) for many use cases. At LiveSensus, Our goal is to accurately measure live audio quality. This includes capturing and labeling audio quality degradations and other audio features, as well as providing a meaningful quality score that is in line with how people perceive quality. To that end, we created a novel audio dataset that features conversational audio with various quality issues that are common in a live streaming video.

The main focus of the dataset is to capture perception-based quality scores for some of the most common and annoying audio issues in a live Voice-over-IP (VoIP) setting. To do this, we created an online survey (livesensus.com/Survey) that recorded users’ reactions to various audio clips with quality issues. This had the bonus effect of letting users use the same listening equipment that they would normally use when viewing or engaging in live streams. It also forced us to focus on the most annoying and distracting audio issues, since these would be detectable across many different types of speakers and headphones. To gauge the listener’s perception of quality, we relied on the Mean Opinion Score (MOS).

The quality issues we focused on include echo, network interruptions like packet loss, vocal interruptions due to latency, background noise, and dynamic range compression. In each of these categories, we took source clips from (link VCC), and applied each quality issue at various degrees of intensity. We interpreted intensity by manually listening to several audio files.

The survey worked as follows. Users entered some non-identifying information about their audio setup and were given some instructions about how to complete the survey. As users went through the survey, they were provided with practice audio samples to understand the various quality issues being presented. Embedded throughout the survey were various hidden assessments that tested users’ understanding of the English language, their listening setup, and their levels of attention.

To obtain the number of ratings we needed, our team tried two approaches. First, we pushed the survey organically in the hopes of reaching a diverse audience in various online communities. We did this by posting our survey on Reddit, LinkedIn, and other social platforms. We managed to obtain over 500 unique responses, and this trial also served to validate our survey as we incorporated feedback in aspects of the survey design and technical functionality. While crowdsourcing had many benefits from an iterative design standpoint, we knew it wouldn’t get us the entire dataset which we needed. For that, we published the survey on Amazon Mechanical Turk. We ran two main batches, one that required users to be from predominantly English-speaking countries and be master workers as well as one that was more general in scope. With these batches of labeled data, we were able to cross-validate and ensure that the results coming from our general batch were indeed of high quality and accuracy.

Results After obtaining over 6400 unique responses and 166,000 individual audio ratings, we identified several trends in the data. Many of these were expected like the mean opinion score decreasing as quality issue intensity increased and dynamic range compression scoring higher on average than network interruptions. What is interesting about the data is users’ intolerance or low opinion of several issue types. Issues like echo seem as unfavorable as issues like background noise and network interruptions seem as annoying as vocal interruptions maybe because both can render a conversation incomprehensible. With background noise, non-stationary noises like a dog barking or siren had noticeably lower scores than ambient noises. Overall, it seems that any of these issues, at certain thresholds, can make users incredibly annoyed with an audio service or experience. We think our dataset provides a good range above, below, and at such thresholds.

More analytics on our dataset can be found here

Data Analytics

Download Dataset

This dataset contains audio issues that can make a voice call or live stream incomprehensible. Issues can be caused by the network, codec, recording environment, and software concealment techniques. This dataset contains speech recordings with a range of degradations applied at various intensities, each with an associative set of roughly 20 subjective opinion scores.

Download .zip