Tutorial Descriptions

 

The following tutorials listed below require online pre-registration. Please visit the registration page for more information on how to register for a tutorial session.

 

Thursday, 8 SEptember - MORNING TUTORIALS

 

ATTENDING TO SPEECH AND AUDIO

Thursday, 8 September | 08:30 – 12:00 | Bayview A

 

Names and affiliation of organizer:

  • Malcolm Slaney, Google Machine Hearing Research, USA

 

Abstract:

This tutorial brings together a number of topics related to attention and listening effort that are important for speech practitioners and researchers interested in the next generation of speech applications. This tutorial will describe applications, auditory saliency models, top­down vs. bottom­up attention, an attentional model of auditory scene analysis, listening effort (which is limited by attention) as a measure of speech quality, using attention for ASR, and finally decoding attention from EEG, MEG and ECoG signals.

 

MACHINE LEARNING FOR SPEAKER RECOGNITION

Thursday, 8 September | 08:30 – 12:00 | Bayview B

 

Names and affiliation of organizers:

  • Man-Wai Mak, The Hong Kong Polytechnic University, Hong Kong
  • Jen-Tzung Chien, National Chiao Tung University, Taiwan

 

Abstract:

In this tutorial, we will present state-of-the-art techniques for speaker recognition and some related tasks such as speaker diarization. This tutorial shall cover different components in speaker recognition including front-end feature extraction, back-end modeling and scoring. A range of learning models will be detailed – from GMMs, SVM, PLDA to deep neural networks – along with learning algorithms – from Bayesian learning, unsupervised learning, discriminative learning, transfer learning to deep learning. A series of case studies and modern models based on PLDA and DNN will be addressed. In particular, different variants of deep models and their solutions to different problems in speaker recognition are presented. In addition, we will point out some new trends for speaker recognition including model regularization and deep belief networks.

 

SPOKEN CONTENT RETRIEVAL – BEYOND CASCADING SPEECH RECOGNITION WITH TEXT RETRIEVAL

Thursday, 8 September | 08:30 – 12:00 | Seacliff A

 

Names and affiliation of organizers:

  • Lin-shan Lee, National Taiwan University, Taiwan
  • Hung-yi Lee, National Taiwan University, Taiwan

 

Abstract: 

Spoken content retrieval refers to retrieving spoken content direcly based on the audio without relying on the text descriptions offered by the context provider. It has been very successful with the basic approach of cascading automatic speech recognition (ASR) with text information retrieval: after the spoken content is transcribed into text or lattice format, a text retrieval engine searches over the ASR output to find desired information. This framework works well when the ASR accuracy is relatively high, but becomes less adequate when more challenging real-world scenarios are considered, since retrieval performance depends heavily on ASR accuracy. This leads to the emergence of another approach to spoken content retrieval: to go beyond the basic framework of cascading ASR with text retrieval, in order to have retrieval performance less dependent on ASR accuracy. This tutorial is intended to provide an overview of the major technical contributions along this second line of investigation.

 

RECENT ADVANCES IN DISTANT SPEECH RECOGNITION

Thursday, 8 September | 08:30 – 12:00 | Seacliff BCD

 

Names and affiliation of organizers:

  • Marc Delcroix, NTT Communication Science Laboratories, Japan
  • Shinji Watanabe, Mitsubishi Electric Research Laboratories, USA

 

Abstract: 

Automatic speech recognition (ASR) is being deployed successfully more and more in products such as voice search applications for mobile devices. However, it remains challenging to perform recognition when the speaker is distant from the microphone, because of the presence of noise, attenuation, and reverberation. Research on distant ASR has received increased attention, and has progressed rapidly due to the emergence of 1) deep neural network (DNN) based ASR systems, 2) the launch of recent challenges such as CHiME series, REVERB, ASpIRE, and DIRHA, and 3) the development of new products such as the Microsoft Kinect and the AMAZON Echo. This tutorial will review the recent progresses made in the field of distant speech recognition in the DNN era, including single and multi-channel speech enhancement front-ends, and acoustic modeling techniques for robust back-ends. The tutorial will also introduce practical schemes for building distant ASR systems based on the expertise acquired from past challenges.

 

THURSDAY, 8 SEPTEMBER - AFTERNOON TUTORIALS

 

SINGING SYNTHESIS

Thursday, 8 September | 13:00 – 16:30 | Bayview A

 

Names and affiliation of organizer:

  • Christophe d’Alessandro, LIMSI-CNRS, France

 

Abstract:

Singing synthesis touches on several areas in Speech Communication, like voice quality, vocal emotion and expression, text-to-speech synthesis, voice personality, speech signal modeling and synthesis, voice transformation, and of course music.  It is as old as speech synthesis, with a number of challenging research questions, and several successful applications in the music industry, including movies soundtracks, avant-garde contemporary music, or artificial characters in pop music. This tutorial aims at presenting: 1/ a review of the scientific bases, history and open questions in singing synthesis research; 2/ a state of the art in the design, applications and evaluation of text-to-chant system, singing instruments, and singing processing systems.

 
PUSHING THE FRONTIERS OF SPEECH PROCESSING – WHAT DOES IT TAKE TO TACKLE NEW LANGUAGES AND DOMAINS?

Thursday, 8 September | 13:00 – 16:30 | Bayview B

 

Names and affiliation of organizers:

  • Samuel Thomas, IBM T.J. Watson Research Center, USA
  • Florian Metze, Carnegie Melon University, USA
  • Brian Kingsbury, IBM T.J. Watson Research Center, USA
  • Bhuvana Ramabhadran, IBM T.J. Watson Research Center, USA

 

Abstract:

Although the real­world impact of speech technology has grown significantly in the past few years, especially in mobile applications, this growth has largely been limited to only well studied languages and domains. Speech technologies must become truly universal by being available in the numerous languages and dialects spoken by people across the globe, even under low resource conditions. The performance of these technologies can also disappoint when they are deployed in new domains other than those available during training, even in well researched languages. This tutorial reviews technological breakthroughs in building speech processing systems in new languages and domains. The tutorial draws on techniques and results from several evaluation campaigns, including MediaEval’s QUESST (Spoken Web Search), the IARPA Babel and NIST OpenKWS evaluations, and the IARPA ASpIRE challenge.

 

HEARING ASSISTIVE TECHNOLOGIES: CHALLENGES AND OPPORTUNITIES

Thursday, 8 September | 13:00 – 16:30 | Seacliff A

 

Names and affiliation of organizers:

  • Oldooz Hazrati, University of Texas at Dallas, USA
  • Hussnain Ali, University of Texas at Dallas, USA
  • John H.L. Hansen, University of Texas at Dallas, USA
  • James M. Kates, University of Colorado Boulder, USA

 

Abstract:

This tutorial will cover an overview of hearing assistive devices (e.g. hearing aids and cochlear implants), challenging listening environments (e.g. noisy, reverberant, whisper/vocal effort, babble noise), current advancements and technologies, as well as future directions (e.g. naturalistic evaluations, next generation spaces).

 

Data-Driven Approaches to Speech Enhancement and Separation

Thursday, 8 September | 13:00 – 16:30 | Seacliff BCD

 

Names and affiliation of organizer:

  • Jonathan Le Roux, Mitsubishi Electric Research Labs (MERL), USA
  • Emmanuel Vincent, Inria, France
  • Hakan Erdogan, Sabanci University, Turkey

 

Abstract:

Being able to isolate a target speech signal from background signals is of direct importance for telephony, hands-free communication and audio surveillance, and it is also critical as a pre-processing step in applications such as voice activity detection, automatic speaker identification, and most importantly automatic speech recognition (ASR) in challenging environments. While speech enhancement and separation methods originally did not rely on training, there has recently been an explosion in the use of machine learning based methods that exploit large amounts of training data. This tutorial will present a broad overview of these methods, analyzing the insights that can be gained from the pre-deep-learning era of graphical modeling and NMF approaches, then diving into an in-depth presentation of recent deep learning approaches encompassing single-channel methods, multi-channel methods, and new directions.