|
Title: State-of-the-Art of End-to-End Speech Recognition
Abstract Over the last several years, there has been growing interests in developing end-to-end (E2E) automatic speech recognition (ASR) systems. E2E ASR is characterized by eliminating the construction of GMM-HMMs and phonetic decision-trees, training deep neural networks directly and, even further, removing the need for pronunciation lexicons and training acoustic and language models jointly rather than separately. Examples of such models include Connectionist Temporal Classification (CTC), attention based encoder-decoder (AED), and RNN transducer (RNNT) respectively. The purpose of this tutorial is to present a systematic introduction to state-of-the-art of E2E ASR. First, we will introduce basic E2E ASR methods within the probabilistic graphical modeling (PGM) framework. We will separate neural network architectures and probabilistic model definitions, with emphasis on comparing and connecting different E2E ASR models that have been considered in the literature. Then, we will present a number of advanced topics for improving E2E ASR, including data-efficiency, low-latency streaming recognition, neural architecture search, multilingual and crosslingual ASR, and contextual biasing. Finally, the tutorial will point out some open questions about existing E2E ASR methods and discuss future directions to address these challenges. In addition, we will introduce open-source toolkits such as https://github.com/thu-spmi/CAT to help the audience to get familiar with the state-of-the-art techniques of E2E ASR. Program Schedule First Part     ● Basics for end-to-end speech recognition         ○ Probabilistic graphical modeling (PGM) framework         ○ Classic hybrid DNN-HMM models         ○ Connectionist Temporal Classification (CTC)         ○ Attention based encoder-decoder (AED)         ○ RNN transducer (RNNT)         ○ Conditional random fields and sequence discriminative training Second Part     ● Improving end-to-end speech recognition         ○ Data-efficiency         ○ Low-latency streaming recognition         ○ Neural architecture search         ○ Multilingual and crosslingual ASR         ○ Contextual biasing     ● Open questions and future directions Lecturers Zhijian Ou Associate Professor, Tsinghua University, China Zhijian Ou received his Ph.D. from Tsinghua University in 2003. Since 2003, he has been with the Department of Electronic Engineering in Tsinghua University and is currently an associate professor. From August 2014 to July 2015, he was a visiting scholar at Beckman Institute, University of Illinois at Urbana-Champaign, USA. He has actively led national research projects as well as research funding from Intel, Panasonic, IBM, Toshiba and Apple. He currently serves as associate editor of IEEE/ACM Transactions on Audio, Speech and Language Processing, member of IEEE Speech and Language Processing Technical Committee, and was General Chair of SLT 2021, Tutorial Chair of INTERSPEECH 2020. His research interests are speech and language processing (particularly speech recognition and dialogs) and machine intelligence (particularly with probabilistic graphical models and deep learning). | |||||
|