Speech Recognition Using Deep Neural Networks: A Systematic Review

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

speech recognition Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Improving Deep Learning based Automatic Speech Recognition for Gujarati

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.

Dereverberation of autoregressive envelopes for far-field speech recognition

A computational look at oral history archives.

Computational technologies have revolutionized the archival sciences field, prompting new approaches to process the extensive data in these collections. Automatic speech recognition and natural language processing create unique possibilities for analysis of oral history (OH) interviews, where otherwise the transcription and analysis of the full recording would be too time consuming. However, many oral historians note the loss of aural information when converting the speech into text, pointing out the relevance of subjective cues for a full understanding of the interviewee narrative. In this article, we explore various computational technologies for social signal processing and their potential application space in OH archives, as well as neighboring domains where qualitative studies is a frequently used method. We also highlight the latest developments in key technologies for multimedia archiving practices such as natural language processing and automatic speech recognition. We discuss the analysis of both visual (body language and facial expressions), and non-visual cues (paralinguistics, breathing, and heart rate), stating the specific challenges introduced by the characteristics of OH collections. We argue that applying social signal processing to OH archives will have a wider influence than solely OH practices, bringing benefits for various fields from humanities to computer sciences, as well as to archival sciences. Looking at human emotions and somatic reactions on extensive interview collections would give scholars from multiple fields the opportunity to focus on feelings, mood, culture, and subjective experiences expressed in these interviews on a larger scale.

Contribution of frequency compressed temporal fine structure cues to the speech recognition in noise: An implication in cochlear implant signal processing

Supplemental material for song properties and familiarity affect speech recognition in musical noise, noise-robust speech recognition in mobile network based on convolution neural networks, optical laser microphone for human-robot interaction: speech recognition in extremely noisy service environments, improving low-resource tibetan end-to-end asr by multilingual and multilevel unit modeling.

AbstractConventional automatic speech recognition (ASR) and emerging end-to-end (E2E) speech recognition have achieved promising results after being provided with sufficient resources. However, for low-resource language, the current ASR is still challenging. The Lhasa dialect is the most widespread Tibetan dialect and has a wealth of speakers and transcriptions. Hence, it is meaningful to apply the ASR technique to the Lhasa dialect for historical heritage protection and cultural exchange. Previous work on Tibetan speech recognition focused on selecting phone-level acoustic modeling units and incorporating tonal information but underestimated the influence of limited data. The purpose of this paper is to improve the speech recognition performance of the low-resource Lhasa dialect by adopting multilingual speech recognition technology on the E2E structure based on the transfer learning framework. Using transfer learning, we first establish a monolingual E2E ASR system for the Lhasa dialect with different source languages to initialize the ASR model to compare the positive effects of source languages on the Tibetan ASR model. We further propose a multilingual E2E ASR system by utilizing initialization strategies with different source languages and multilevel units, which is proposed for the first time. Our experiments show that the performance of the proposed method-based ASR system exceeds that of the E2E baseline ASR system. Our proposed method effectively models the low-resource Lhasa dialect and achieves a relative 14.2% performance improvement in character error rate (CER) compared to DNN-HMM systems. Moreover, from the best monolingual E2E model to the best multilingual E2E model of the Lhasa dialect, the system’s performance increased by 8.4% in CER.

Background Speech Synchronous Recognition Method of E-commerce Platform Based on Hidden Markov Model

In order to improve the effect of e-commerce platform background speech synchronous recognition and solve the problem that traditional methods are vulnerable to sudden noise, resulting in poor recognition effect, this paper proposes a background speech synchronous recognition method based on Hidden Markov model. Combined with the principle of speech recognition, the speech feature is collected. Hidden Markov model is used to input and recognize high fidelity speech filter to ensure the effectiveness of signal processing results. Through the de-noising of e-commerce platform background voice, and the language signal cache and storage recognition, using vector graph buffer audio, through the Ethernet interface transplant related speech recognition sequence, thus realizing background speech synchronization, so as to realize the language recognition, improve the recognition accuracy. Finally, the experimental results show that the background speech synchronous recognition method based on Hidden Markov model is better than the traditional methods.

Massive Speech Recognition Resource Scheduling System based on Grid Computing

Nowadays, there are too many large-scale speech recognition resources, which makes it difficult to ensure the scheduling speed and accuracy. In order to improve the effect of large-scale speech recognition resource scheduling, a large-scale speech recognition resource scheduling system based on grid computing is designed in this paper. In the hardware part, microprocessor, Ethernet control chip, controller and acquisition card are designed. In the software part of the system, it mainly carries out the retrieval and exchange of information resources, so as to realize the information scheduling of the same type of large-scale speech recognition resources. The experimental results show that the information scheduling time of the designed system is short, up to 2.4min, and the scheduling accuracy is high, up to 90%, in order to provide some help to effectively improve the speed and accuracy of information scheduling.

Export Citation Format

Share document.

NTRS - NASA Technical Reports Server

Available downloads, related records.

Initial decoding with minimally augmented language model for improved lattice rescoring in low resource ASR

  • Published: 21 May 2024
  • Volume 49 , article number  183 , ( 2024 )

Cite this article

research papers on voice recognition

  • Savitha Murthy   ORCID: orcid.org/0000-0001-7560-0420 1 &
  • Dinkar Sitaram 2  

21 Accesses

Explore all metrics

Automatic speech recognition systems for low-resource languages typically have smaller corpora on which the language model is trained. Decoding with such a language model leads to a high word error rate due to the large number of out-of-vocabulary words in the test data. Larger language models can be used to rescore the lattices generated from initial decoding. This approach, however, gives only a marginal improvement. Decoding with a larger augmented language model, though helpful, is memory intensive and not feasible for low resource system setup. The objective of our research is to perform initial decoding with a minimally augmented language model. The lattices thus generated are then rescored with a larger language model. We thus obtain a significant reduction in error for low-resource Indic languages, namely, Kannada and Telugu. This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource languages where the baseline language model is not sufficient for generating inclusive lattices. We minimally augment the baseline language model with unigram counts of words that are present in a larger text corpus of the target language but absent in the baseline. The lattices generated after decoding with a minimally augmented baseline language model are more comprehensive for rescoring. We obtain 21.8% (for Telugu) and 41.8% (for Kannada) relative word error reduction with our proposed method. This reduction in word error rate is comparable to 21.5% (for Telugu) and 45.9% (for Kannada) relative word error reduction obtained by decoding with full Wikipedia text augmented language mode while our approach consumes only 1/8th the memory. We demonstrate that our method is comparable with various text selection-based language model augmentation and also consistent for data sets of different sizes. Our approach is applicable for training speech recognition systems under low resource conditions where speech data and compute resources are insufficient, while there is a large text corpus that is available in the target language. Our research involves addressing the issue of out-of-vocabulary words of the baseline in general and does not focus on resolving the absence of named entities. Our proposed method is simple and yet computationally less expensive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research papers on voice recognition

The example words in Telugu are represented using IITM label set notation [ 64 ]

Accessed July 13, 2017. http://festvox.org/databases/iiit voices.

Hazen Timothy J, Shen Wade and White Christopher 2009 Query-by-example spoken term detection using phonetic posteriorgram templates. In: 2009 IEEE Workshop on Automatic Speech Recognition & Understanding , pp. 421–426. IEEE

Novotney Scott and Callison-Burch Chris 2010 Cheap, fast and good enough: Automatic speech recognition with non-expert transcription. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pp. 207–215

Thomas Samuel, Ganapathy Sriram and Hermansky Hynek 2010 Cross-lingual and multi-stream posterior features for low resource lvcsr systems. In: Eleventh Annual Conference of the International Speech Communication Association

Ivan Bulyko, Mari Ostendorf, Manhung Siu, Tim Ng, Andreas Stolcke and Özgür Çetin 2007 Web resources for language modeling in conversational speech recognition. ACM Transactions on Speech and Language Processing (TSLP) 5(1): 1–25

Article   Google Scholar  

Mendels Gideon, Cooper Erica, Soto Victor, Hirschberg Julia, Gales Mark JF, Knill Kate M, Ragni Anton and Wang Haipeng 2015 Improving speech recognition and keyword search for low resource languages using web data. In: INTERSPEECH 2015: 16th Annual Conference of the International Speech Communication Association , pp. 829–833. International Speech Communication Association (ISCA)

Sethy Abhinav, Georgiou Panayiotis and Narayanan Shrikanth 2006 Text data acquisition for domain-specific language models. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing , pp. 382–389

Parada Carolina, Sethy Abhinav, Dredze Mark and Jelinek Frederick 2010 A spoken term detection framework for recovering out-of-vocabulary words using the web. In: Eleventh annual conference of the international speech communication association

Ng Tim, Ostendorf Mari, Hwang Mei-Yuh, Siu Manhung, Bulyko Ivan and Lei Xin 2005 Web-data augmented language models for mandarin conversational speech recognition. In: Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 , 1:I–589. IEEE, 2005

Beck Eugen, Zhou Wei, Schlüter Ralf and Ney Hermann 2019 Lstm language models for lvcsr in first-pass decoding and lattice-rescoring. arXiv preprint arXiv:1907.01030

Song Xingcheng, Wu Zhiyong, Huang Yiheng, Su Dan and Meng Helen 2020 Specswap: A simple data augmentation method for end-to-end speech recognition. In: INTERSPEECH , pp. 581–585

Ko Tom, Peddinti Vijayaditya, Povey Daniel and Khudanpur Sanjeev 2015 Audio augmentation for speech recognition. In: Sixteenth annual conference of the international speech communication association

Vachhani Bhavik, Bhat Chitralekha and Kopparapu Sunil Kumar 2018 Data augmentation using healthy speech for dysarthric speech recognition. In: Interspeech , pp. 471–475

Tüske Zoltán, Golik Pavel, Nolden David, Schlüter Ralf and Ney Hermann 2014 Data augmentation, feature combination, and multilingual neural networks to improve ASR and KWS performance for low-resource languages. In: Fifteenth Annual Conference of the International Speech Communication Association

Jaitly Navdeep and Hinton Geoffrey E 2013 Vocal tract length perturbation (vtlp) improves speech recognition. In: Proc. ICML Workshop on Deep Learning for Audio, Speech and Language , 117:21

Chen Guoguo, Na Xingyu, Wang Yongqing, Yan Zhiyong, Zhang Junbo, Ma Sifan and Wang Yujun 2020 Data augmentation for children’s speech recognition–the ethiopian system for the slt 2021 children speech recognition challenge. arXiv preprint arXiv:2011.04547

Medennikov Ivan, Khokhlov Yuri Y, Romanenko Aleksei, Popov Dmitry, Tomashenko Natalia A, Sorokin Ivan and Zatvornitskiy Alexander 2018 An investigation of mixup training strategies for acoustic models in asr. In: Interspeech , pp. 2903–2907

Saon George, Tüske Zoltán, Audhkhasi Kartik and Kingsbury Brian 2019 Sequence noise injected training for end-to-end speech recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6261–6265. IEEE, 2019

Zhu Yingke, Ko Tom and Mak Brian 2019 Mixup learning strategies for text-independent speaker verification. In: Interspeech , pp. 4345–4349

Ghahremani Pegah, Manohar Vimal, Hadian Hossein, Povey Daniel and Khudanpur Sanjeev 2017 Investigation of transfer learning for ASR using LF-MMI trained neural networks. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pp. 279–286. IEEE, 2017

Manohar Vimal, Povey Daniel and Khudanpur Sanjeev 2017 Jhu kaldi system for arabic mgb-3 asr challenge using diarization, audio-transcript alignment and transfer learning. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pp. 346–352. IEEE

Wang Sicheng, Li Wei, Siniscalchi Sabato Marco and Lee Chin-Hui 2020 A cross-task transfer learning approach to adapting deep speech enhancement models to unseen background noise using paired senone classifiers. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6219–6223. IEEE

Li Jason, Gadde Ravi, Ginsburg Boris and Lavrukhin Vitaly 2018 Training neural speech recognition systems with synthetic speech augmentation. arXiv preprint arXiv:1811.00707

Laptev Aleksandr, Korostik Roman, Svischev Aleksey, Andrusenko Andrei, Medennikov Ivan and Rybin Sergey 2020 You do not need more data: Improving end-to-end speech recognition by text-to-speech data augmentation. In: 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) , pp. 439–444. IEEE

Rosenberg Andrew, Zhang Yu, Ramabhadran Bhuvana, Jia Ye, Moreno Pedro, Wu Yonghui and Wu Zelin 2019 Speech recognition with augmented synthesized speech. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU) , pp. 996–1002. IEEE

Beneš Karel and Burget Lukáš 2020 Text augmentation for language models in high error recognition scenario. arXiv preprint arXiv:2011.06056

Peyser Cal, Mavandadi Sepand, Sainath Tara N, Apfel James, Pang Ruoming and Kumar Shankar 2020 Improving tail performance of a deliberation e2e asr model using a large text corpus. arXiv preprint arXiv:2008.10491

Sharma Yash, Abraham Basil, Taneja Karan and Jyothi Preethi 2020 Improving low resource code-switched asr using augmented code-switched tts. arXiv preprint arXiv:2010.05549

Meng Linghui, Xu Jin, Tan Xu, Wang Jindong, Qin Tao and Xu Bo 2021 Mixspeech: Data augmentation for low-resource automatic speech recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. ges 7008–7012. IEEE

Rossenbach Nick, Zeyer Albert, Schlüter Ralf and Ney Hermann 2020 Generating synthetic audio data for attention-based speech recognition systems. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 7069–7073. IEEE, 2020

Lin James, Kilgour Kevin, Roblek Dominik and Sharifi Matthew 2020 Training keyword spotters with limited and synthesized speech data. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 7474–7478. IEEE

Hartmann William, Ng Tim, Hsiao Roger, Tsakalidis Stavros and Schwartz Richard M 2016 Two-stage data augmentation for low-resourced speech recognition. In: Interspeech , pp. 2378–2382

Hailu Nirayo, Siegert Ingo and Nürnberger Andreas 2020 Improving automatic speech recognition utilizing audio-codecs for data augmentation. In: 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP) , pp. 1–5. IEEE

Ilyes Rebai, Yessine BenAyed, Walid Mahdi and Jean-Pierre Lorré 2017 Improving speech recognition using data augmentation and acoustic model fusion. Procedia Computer Science 112: 316–322

Rosenberg Andrew, Audhkhasi Kartik, Sethy Abhinav, Ramabhadran Bhuvana and Picheny Michael 2017 End-to-end speech recognition and keyword search on low-resource languages. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 5280–5284. IEEE

Vydana Hari Krishna, Gurugubelli Krishna, Vegesna Vishnu Vidyadhara Raju and Vuppala Anil Kumar 2018 An exploration towards joint acoustic modeling for Indian languages: Iiit-h submission for low resource speech recognition challenge for Indian languages, interspeech In: INTERSPEECH , pp. 3192–3196

Fathima Noor, Patel Tanvina, Mahima C and Iyengar Anuroop 2018 Tdnn-based multilingual speech recognition system for low resource indian languages. In: INTERSPEECH , pp. 3197–3201

Pulugundla Bhargav, Baskar Murali Karthick, Kesiraju Santosh, Egorova Ekaterina, Karafiát Martin, Burget Lukás and Cernockỳ Jan 2018 But system for low resource Indian language ASR. In: INTERSPEECH , pp. 3182–3186

Shetty Vishwas M, Sharon Rini A, Abraham Basil, Seeram Tejaswi, Prakash Anusha, Ravi Nithya and Umesh Srinivasan 2018 Articulatory and stacked bottleneck features for low resource speech recognition. In: INTERSPEECH , pp. 3202–3206

Billa Jayadev 2018 Isi ASR system for the low resource speech recognition challenge for Indian languages. In: INTERSPEECH , pp. 3207–3211

Yılmaz Emre, Heuvel Henk van den and Leeuwen David Avan 2018 Acoustic and textual data augmentation for improved asr of code-switching speech. arXiv preprint arXiv:1807.10945

Klakow Dietrich, Rose Georg and Aubert Xavier 1999 Oov-detection in large vocabulary system using automatically defined word-fragments as fillers. In: Sixth European Conference on Speech Communication and Technology

Schaaf Thomas 2001 Detection of OOV words using generalized word models and a semantic class language model. In: Seventh European Conference on Speech Communication and Technology, 2001

Norihide Kitaoka, Bohan Chen and Yuya Obashi 2021 Dynamic out-of-vocabulary word registration to language model for speech recognition. EURASIP Journal on Audio, Speech, and Music Processing 2021(1): 1–8

Google Scholar  

Thomas Samuel, Audhkhasi Kartik, Tüske Zoltán, Huang Yinghui and Picheny Michael 2019 Detection and recovery of oovs for improved english broadcast news captioning. In: INTERSPEECH , pp. 2973–2977

Hazen Timothy J and Bazzi Issam 2001 A comparison and combination of methods for oov word detection and word confidence scoring. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221) , 1:397–400. IEEE

Yazgan Ali and Saraclar Murat 2004 Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing , 1:I–745. IEEE

Ketabdar Hamed, Hannemann Mirko and Hermansky Hynek 2007 Detection of out-of-vocabulary words in posterior based ASR. In: Eighth Annual Conference of the International Speech Communication Association

White Christopher, Zweig Geoffrey, Burget Lukas, Schwarz Petr and Hermansky Hynek 2008 Confidence estimation, oov detection and language id using phone-to-word transduction and phone-level alignments. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 4085–4088. IEEE

Rastrow Ariya, Sethy Abhinav and Ramabhadran Bhuvana 2009 A new method for OOV detection using hybrid word/fragment system. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 3953–3956. IEEE

Zhang Xiaohui, Povey Daniel and Khudanpur Sanjeev 2020 OOV recovery with efficient 2nd pass decoding and open-vocabulary word-level RNNLM rescoring for hybrid ASR. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6334–6338. IEEE

Lin Hui, Bilmes Jeff, Vergyri Dimitra and Kirchhoff Katrin 2007 Oov detection by joint word/phone lattice alignment. In: 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) , pp. 478–483. IEEE

Burget Lukas, Schwarz Petr, Matejka Pavel, Hannemann Mirko, Rastrow Ariya, White Christopher, Khudanpur Sanjeev, Hermansky Hynek and Cernocky Jan 2008 Combination of strongly and weakly constrained recognizers for reliable detection of OOVS. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 4081–4084. IEEE

Kombrink Stefan, Burget Lukáš, Matějka Pavel, Karafiát Martin and Hermansky Hynek 2009 Posterior-based out of vocabulary word detection in telephone speech. In: Tenth Annual Conference of the International Speech Communication Association

Murthy Savitha, Sitaram Dinkar and Sitaram Sunayana 2018 Effect of TTS generated audio on OOV detection and word error rate in ASR for low-resource languages. In: Interspeech , pp. 1026–1030

Hori Takaaki, Watanabe Shinji and Hershey John R 2017 Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pp. 287–293. IEEE

Hannemann Mirko, Kombrink Stefan, Karafiát Martin and Burget Lukáš 2010 Similarity scoring for recognizing repeated out-of-vocabulary words. In: Eleventh Annual Conference of the International Speech Communication Association

Illina I and Fohr Dominique 2020 RNN Language Model Estimation for Out-of-Vocabulary Words, pp. 199–211 12

Egorova Ekaterina and Burget Lukás 2018 Out-of-vocabulary word recovery using fst-based subword unit clustering in a hybrid ASR system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018 , Calgary, AB, Canada, April 15-20

Qin Long, Sun Ming and Rudnicky Alexander I 2011 OOV detection and recovery using hybrid models with different fragments. In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association , Florence, Italy, August 27–31

Welly Naptali, Masatoshi Tsuchiya and Seiichi Nakagawa 2012 Class-based n-gram language model for new words using out-of-vocabulary to in-vocabulary similarity. IEICE Trans. Inf. Syst. 95–D(9): 2308–2317

Orosanu Luiza and Jouvet Denis 2015 Adding new words into a language model using parameters of known words with similar behavior. In: 1st International Conference on Natural Language and Speech Processing, ICNLSP 2015, Algiers, Algeria, October 18–19, 2015, volume 128 of Procedia Computer Science (eds) Mourad Abbas and Ahmed Abdelali, Elsevier, pp 18–24

Wang Wei, Zhou Zhikai, Lu Yizhou, Wang Hongji, Du Chenpeng and Qian Yanmin 2021 Towards data selection on TTS data for children’s speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021 , Toronto, ON, Canada, June 6–11

Zheng Xianrui, Liu Yulan, Gunceler Deniz and Willett Daniel 2021 Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end asr systems. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021 , Toronto, ON, Canada, June 6–11, 2021

Shetty Vishwas M and Umesh S 2021 Exploring the use of common label set to improve speech recognition of low resource indian languages. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 7228–7232. IEEE

Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar and Andreas Stolcke 2007 Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP) 5(1): 1–29

Rasa Lileikytė, Lori Lamel, Jean-Luc Gauvain and Arseniy Gorin 2018 Conversational telephone speech recognition for lithuanian. Computer Speech & Language 49: 71–82

He Yanzhang, Hutchinson Brian, Baumann Peter, Ostendorf Mari, Fosler-Lussier Eric and Pierrehumbert Janet 2014 Subword-based modeling for handling OOV words in keyword spotting. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 7864–7868. IEEE

Choueiter Ghinwa, Povey Daniel, Chen Stanley F and Zweig Geoffrey 2006 Morpheme-based language modeling for arabic lvcsr. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings , 1:I–I. IEEE

Povey Daniel, Hannemann Mirko, Boulianne Gilles, Burget Lukáš, Ghoshal Arnab, Janda Miloš, Karafiát Martin, Kombrink Stefan, Motlíček Petr, Qian Yanmin, et al. 2012 Generating exact lattices in the WFST framework. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 4213–4216. IEEE

Srivastava Brij Mohan Lal, Sitaram Sunayana, Mehta Rupesh Kumar, Mohan Krishna Doss, Matani Pallavi, Satpal Sandeepkumar, Bali Kalika, Srikanth Radhakrishnan and Nayak Niranjan 2018 Interspeech low resource automatic speech recognition challenge for Indian languages. In: SLTU , pp. 11–14

Povey Daniel, Cheng Gaofeng, Wang Yiming, Li Ke, Xu Hainan, Yarmohammadi Mahsa and Khudanpur Sanjeev 2018 Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Interspeech , pp. 3743–3747

Povey Daniel, Ghoshal Arnab, Boulianne Gilles, Burget Lukas, Glembek Ondrej, Goel Nagendra, Hannemann Mirko, Motlicek Petr, Qian Yanmin, Schwarz Petr e al. 2011 The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, number CONF . IEEE Signal Processing Society

Pusateri Ernest, Van Gysel Christophe, Botros Rami, Badaskar Sameer, Hannemann Mirko, Oualil Youssef and Oparin Ilya 2019 Connecting and comparing language model interpolation techniques. arXiv preprint arXiv:1908.09738

Hsu Bo-June 2007 Generalized linear interpolation of language models. In: 2007 IEEE workshop on automatic speech recognition & understanding (ASRU) , pp. 136–140. IEEE

Chen Zhehuai, Rosenberg Andrew, Zhang Yu, Wang Gary, Ramabhadran Bhuvana and Moreno Pedro J 2020 Improving speech recognition using GAN-based speech synthesis and contrastive unspoken text selection. In: INTERSPEECH , pp. 556–560

Klakow Dietrich 2000 Selecting articles from the language model training corpus. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100) , 3:1695–1698. IEEE

Itoh Nobuyasu, Sainath Tara N, Jiang Dan Ning, Zhou Jie and Ramabhadran Bhuvana 2012 N-best entropy based data selection for acoustic modeling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 4133–4136. IEEE

Download references

Acknowledgements

We thank Dr K.V. Subramanian, Head, Center for Cloud Computing and Big Data, PES University, Bangalore, for all the support.

Author information

Authors and affiliations.

Department of CSE, P.E.S. University, Bengaluru, India

Savitha Murthy

Cloud Computing Innovation Council of India, Bengaluru, India

Dinkar Sitaram

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Savitha Murthy .

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Murthy, S., Sitaram, D. Initial decoding with minimally augmented language model for improved lattice rescoring in low resource ASR. Sādhanā 49 , 183 (2024). https://doi.org/10.1007/s12046-024-02520-0

Download citation

Received : 01 January 2022

Revised : 01 August 2023

Accepted : 12 February 2024

Published : 21 May 2024

DOI : https://doi.org/10.1007/s12046-024-02520-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Indic languages
  • telugu and Kannada ASR
  • low resource
  • out of vocabulary
  • language model augmentation
  • automatic speech recognition
  • Find a journal
  • Publish with us
  • Track your research

Weill Cornell Medicine

  • Weill Cornell Medicine

Sean Parker Institute for the Voice

Sean Parker Laryngologists Present on Artificial Intelligence, Vocal Injury and Return to Performance After Hemorrhage at American Laryngological Association Spring Meeting

  • Share to Facebook
  • Share to Twitter
  • Share to LinkedIn
  • Share on Email

The American Laryngological Association's Spring Meeting, held May 16-18 in Chicago, saw research teams from the Sean Parker Institute for the Voice present several papers on developing topics in the use of artificial intelligence in the field. These included investigations on the adaptation of speech recognition systems to better understand deaf & hard of hearing speech, AI recognition of vocal fold pathologies on examination, and the development of a model to screen for health conditions using voice alone. The leader of the Sean Parker Institute's effort in AI, Anais Rameau, MD MPhil also moderated a panel on the topic of artifical intelligence. Sean Parker Fellow Christine Clark presented on difference between unilateral and bilateral vocal fold pseudocysts, lesions resulting from the physicial stresses on the vocal folds from voice use. Her investigation showed that both lesions resulted from imperfect glottic closure, although the closure was impaired by nerve damage in the case of unilateral lesions, but the result of normal female laryngeal anatomy in bilateral lesions. Another investigation documented patients' overwhelmingly successful return to performance after vocal fold hemorrhage. 

Sean Parker Institute for the Voice Weill Cornell Medical College 240 E 59th Street New York, NY 10022 Map it

Help | Advanced Search

Computer Science > Sound

Title: real-time detection of ai-generated speech for deepfake voice conversion.

Abstract: There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion. To address the above emerging issues, the DEEP-VOICE dataset is generated in this study, comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion. Presenting as a binary classification problem of whether the speech is real or AI-generated, statistical analysis of temporal audio features through t-testing reveals that there are significantly different distributions. Hyperparameter optimisation is implemented for machine learning models to identify the source of speech. Following the training of 208 individual machine learning models over 10-fold cross validation, it is found that the Extreme Gradient Boosting model can achieve an average classification accuracy of 99.3% and can classify speech in real-time, at around 0.004 milliseconds given one second of speech. All data generated for this study is released publicly for future research on AI speech detection.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Voice Recognition System

    research papers on voice recognition

  2. common-voice-7-0-6 Benchmark (speech-recognition)

    research papers on voice recognition

  3. (PDF) An Overview on Speech Recognition System and Comparative Study of

    research papers on voice recognition

  4. (PDF) Detection of Parkinson Disease Using Variational Mode

    research papers on voice recognition

  5. Common Voice 8.0 Basaa Benchmark (Speech Recognition)

    research papers on voice recognition

  6. (PDF) Speech Recognition Using Deep Neural Networks: A Systematic Review

    research papers on voice recognition

VIDEO

  1. Voice Recognition System Day-52 #shorts #trending #science #technology #experiment

  2. NVIDIA Riva Automatic Speech Recognition for AudioCodes VoiceAI Connect Users

  3. Automatic Speech Recognition: An Overview

  4. Sound Capture and Speech Enhancement for Communication and Distant Speech Recognition

  5. hs previous year question papers voice change 2016//voice change in english grammar in assamese //

  6. How to Add Published Research Paper to Google Scholar

COMMENTS

  1. Speech Recognition Using Deep Neural Networks: A Systematic Review

    Abstract: Over the past decades, a tremendous amount of research has been done on the use of machine learning for speech processing applications, especially speech recognition. However, in the past few years, research has focused on utilizing deep learning for speech-related applications. This new area of machine learning has yielded far better results when compared to others in a variety of ...

  2. A comprehensive survey on automatic speech recognition using neural

    Table 2 presents that we have shortlisted 148 research papers in this study to review. The primary studies having major terms such as speech recognition along with Deep Neural Network, Recurrent Neural Network, Convolution Neural Network, Long Short-term Memory, Denoising, Neural Network, Deep Learning, Transfer Learning, End-to-End ASR have been included.

  3. Trends and developments in automatic speech recognition research

    This paper discusses how automatic speech recognition systems are and could be designed, in order to best exploit the discriminative information encoded in human speech. This contrasts with many recent machine learning approaches that apply general recognition architectures to signals to identify, with little concern for the nature of the input.

  4. Deep Speech: Scaling up end-to-end speech recognition

    We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model ...

  5. Speech Recognition Using Deep Neural Networks: A Systematic Review

    ABSTRACT Over the past decades, a tremendous amount of research has been done on the use of machine. learning for speech processing applications, especially speech recognition. However, in the ...

  6. A review on speaker recognition: Technology and challenges

    Although researchers have been working on speaker recognition in the last eight decades, advancements in technology, such as the Internet of Things (IoT), smart devices, voice assistants, smart homes, and humanoids, have made its usage nowadays trendy. This paper provides a comprehensive review of the literature on speaker recognition.

  7. SPEECH RECOGNITION SYSTEMS

    Speech Recognition is a t echnology with the help of which a machine can. acknowledge the spoken word s and phrases, which can further be used to. generate text. Speech Recognition System works ...

  8. (PDF) VOICE RECOGNITION USING DEEP LEARNING

    Abstract. Deep learning has significantly helped better results for visual-based speech recognition. However, the majority of the research works focus on improving only the training model's ...

  9. speech recognition Latest Research Papers

    AbstractConventional automatic speech recognition (ASR) and emerging end-to-end (E2E) speech recognition have achieved promising results after being provided with sufficient resources. However, for low-resource language, the current ASR is still challenging. The Lhasa dialect is the most widespread Tibetan dialect and has a wealth of speakers ...

  10. Speech Recognition by Machine: A Review

    This paper presents a brief survey on Automatic Speech Recognition and discusses the major themes and advances made in the past 60 years of research, so as to ... 1.2 Basic Model of Speech Recognition: Research in speech processing and communication for the most part, was motivated by people s desire to build mechanical models to emulate human ...

  11. PDF Robust Speech Recognition via Large-Scale Weak Supervision

    pre-training has been underappreciated so far for speech recognition. We achieve these results without the need for the self-supervision or self-training techniques that have been a mainstay of recent large-scale speech recognition work. To serve as a foundation for further research on robust speech recognition, we release inference code and ...

  12. Robust Speech Recognition via Large-Scale Weak Supervision

    We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine ...

  13. A Novel Voice Recognition System with Artificial Intelligence

    In this paper, we highlight a number of problems arise while building speech recognition technology and propose several research and development directions we have undertaken to solve them. M. Al- focused in, Design of an Intelligent Home Assistant, published in 2006 . An intelligent system for home which will control home appliances on a voice ...

  14. Speech emotion recognition using machine learning

    Speech emotion recognition (SER) as a Machine Learning (ML) problem continues to garner a significant amount of research interest, especially in the affective computing domain. This is due to its increasing potential, algorithmic advancements, and applications in real-world scenarios. Human speech contains para-linguistic information that can ...

  15. A voice-based real-time emotion detection technique using recurrent

    The advancements of the Internet of Things (IoT) and voice-based multimedia applications have resulted in the generation of big data consisting of patterns, trends and associations capturing and representing many features of human behaviour. The latent representations of many aspects and the basis of human behaviour is naturally embedded within the expression of emotions found in human speech ...

  16. (PDF) A New Human Voice Recognition System

    A New Human Voice Recognition Sy stem. Pal a Mahesh Kumar. SAK Informatics Pvt. Ltd., Andra Pradesh, In dia. Email: [email protected]. Abstract - In an effort to provide a more efficient ...

  17. Enhancing Air Traffic Control Planning with Automatic Speech Recognition

    The research dataset, consisting of 20 hours of transcribed planning teleconferences, forms the foundation for fine-tuning and validating the Whisper model. ... this paper presents a comprehensive exploration of the application of automatic speech recognition in Air Traffic Control System Command Center planning teleconferences, leveraging the ...

  18. Initial decoding with minimally augmented language model for ...

    There is a lot of interest in Automatic Speech Recognition (ASR) for low-resource languages for more than a decade [1,2,3].Low resource indicates a scarcity of any of the language resources required to traifn traditional ASR, namely, resources such as pronunciation dictionaries, text data to train a Language Model (LM) or audio data with corresponding transcriptions.

  19. Sean Parker Laryngologists Present on Artificial Intelligence, Vocal

    The American Laryngological Association's Spring Meeting, held May 16-18 in Chicago, saw research teams from the Sean Parker Institute for the Voice present several papers on developing topics in the use of artificial intelligence in the field. These included investigations on the adaptation of speech recognition systems to better understand deaf & hard of hearing speech, AI

  20. Title: Real-time Detection of AI-Generated Speech for DeepFake Voice

    There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion. To address ...

  21. (PDF) VOICE RECOGNITION SYSTEM: SPEECH-TO-TEXT

    Abstract: VOICE RECOGNITION SYSTEM:SPEECH-TO-TEXT is a software that lets the user control. computer functions and dictates text b y voice. T he system consists of two components , first co ...

  22. The Application of Speech Recognition in Education and Teaching

    The Application of Speech Recognition in Education and Teaching. Jiahao Mao, Yanhan Wang, +2 authors. Yibo Ding. Published in Proceedings of the 4th… 2023. Education, Computer Science. Proceedings of the 4th International Conference on Modern Education and Information Management, ICMEIM 2023, September 8-10, 2023, Wuhan, China.