IIT Real-Time Communication, WebRTC, Mobility, VoIP, NG911 Conference & Expo


Real Time Communications Conference & Expo at Illinois Tech

IEEE International Conference

  • Home
  • News
  • Sponsors & Exhibitors
    • Become a Sponsor and/or Exhibitor
    • Sponsors
    • Exhibitors
  • Program
  • Conference Speakers
  • Travel/Hotels
  • Propose a Talk
  • Register
  • HackRTC

Presentation

Track: VoiceTech
Evaluating Speech Separation Through Pre-trained Deep Neural Network Models
This presentation focuses on speaker separation, which aims to separate individual speakers from a mixture of voices or background noise, commonly known as the "cocktail party problem." The objective is to separate the two original audios from their mix and analyze the features present that contribute to separation. The analysis proposes obtaining features from the original data and evaluating their impact on the model's ability to separate mixed audio streams.

The dataset is prepared to use these feature values as predictor variables for various models such as Logistic Regression, Decision Trees, SVM, XGBoost, and AdaBoost. The goal is to identify the most contributing features that lead to better separation. The results are then analyzed to determine the features that have the most significant effect on separating the audio streams.

The study begins by selecting 400 audio streams from the VoxCeleb dataset and combining them to form 200 single utterances. The pre-trained Speechbrain model, sepformer-whamr, is utilized to separate the audio mixes and obtain two outputs that closely resemble the original sources. A feature list is generated from the 400 chosen audios, and the impact of certain features on the model's capability to distinguish between multiple audio sources in a mixed recording is assessed. Permutation feature importance and SHAP values are used as analysis parameters to determine the features that have a greater effect on separation.

The hypothesis of the study is that the features contributing the most to effective separation are consistent across datasets. To test this hypothesis, 1,000 audio streams are obtained from the Mozilla Common Voice Dataset, and the same experimental methodology is applied. The results demonstrate that the features extracted from the VoxCeleb dataset are indeed invariant and aid in separating the audio streams of the Mozilla Common Voice dataset.
  • Deeksha Prabhakar - Speaker
Presentation Notes
Prabhakar-Evaluating-Speech-Separation-Through-Pre-trained-Deep-Neural-Network-Models2.pptx

Follow Us

FacebooktwitterlinkedinFacebooktwitterlinkedin

Share This

FacebooktwitterlinkedinmailFacebooktwitterlinkedinmail

News

Recap of HackRTC and RTC Conference 2023: Access Videos and Slides

RTC Conference at Illinois Tech starts tomorrow! – Livestreaming Instructions

RTC Conference Keynote Talks!

More Info:

  • Contact
  • Research Track CFP

© 2012-2013 llinois Institute of Technology School of Applied Technology 201 East Loop Road, Wheaton IL 60189 630.682.6000
3424 South State, Chicago IL 60616 312.567.5280 Emergency Information

© Copyright 2023 RTC-Conference · All Rights Reserved

7ads6x98y