Projects

Speech Tokenizer & Pronunciation Correction for Llama 4 (2025)

The speech tokenizer is a fundamental module for an audio LLM to understand and generate speech. Through rigorous ablations and evaluations on the architecture, loss functions, and training data, I developed a new speech tokenizer for Llama 4, which beat the previous version in reconstruction, understanding, as well as robustness:

Mel-cepstral distortion (MCD) reduced by 36% relative;
Word error rate (WER) reduced by 8.5% relative;
Token consistency in different contexts increased from 67% to 92%.

Besides the tokenizer, I also worked on reducing mispronunciations of Llama 4. I designed a recipe of post-training direct preference optimization (DPO), which improved the pronunciation accuracy of hard words from 35% to 71%. I also mentored an intern to integrate phonetic representations into the pre-training for better pronunciation control.

As I worked on the Llama 4 project, I repeatedly realized the importance of robust evaluation. An evaluation suite with low-variance metrics is indispensable for choosing among candidate models, and for guiding the direction of model iteration.

Autoregressive Diffusion Acoustic Model for Next-Generation TTS (2024)

Diffusion models as acoustic models for TTS can produce much more natural-sounding speech, because they can better model the distribution of acoustic features through multiple denoising steps, and overcome the "oversmoothing" problem of traditional models. However, diffusion models usually run denoising on an entire utterance, making it incompatible with streaming inference.

In this project, I combined diffusion with autogressive modeling in a way similar to "rolling diffusion". During inference, a "transition window" is maintained, in which the signal changes gradually from clean to noisy. As this window moves forward, frames in the window get denoised conditioning on both neighboring noisy frames and clean frames in the past. Such an "autoregressive model" enjoys the high naturalness of diffusion models as well as the low latency of autoregressive models.

To further preserve the target voice's timbre and improve the naturalness, I employed techniques such as global conditioning, in-context learning, flow matching, and classifier-free guidance (CFG).

The autoregressive diffusion acoustic model was deployed as part of the "Next-Gen" TTS system, both on the TTS server and on Ray-Ban Meta smart glasses.

Multilingual & Codeswitching TTS (2022 - 2023)

I led the cross-functional effort to build a multilingual and codeswitching TTS system, which involved engineers, interns, linguists, and product managers. The goal was to make every supported voice speaker five languages (English, German, Spanish, French, Italian), and switch between languages seamlessly within in sentence.

I overhauled almost every module in the TTS pipeline. On the frontend side, I:

Optimized the word-level language ID module to detect codeswitching in the text, boosting the recall from 76% to 93%;
Designed a single neural network for joint text normalization, POS tagging, and homograph disambiguation, improving the accuracy and reducing the latency at the same time;
Built a multilingual G2P transformer to replace existing locale-specific joint sequence models (JSMs), improving the word accuracy by a maximum of 31% while reducing the total model size by 73%.

On the backend side, I:

Ran voice conversion to generate multilingual and codeswitching training data for all voices;
Retrained prosody and acoustic models for native-sounding speech while preserving timbre.

The resultant TTS system was deployed both on Meta's TTS server and on Ray-Ban Meta smart glasses.

Publication: [ICASSP24].

Acoustic Event Detection (2019 - 2021)

I had the fortune to continue to work on acoustic event detection (AED) after joining Meta. As the sole owner of the AED model at Meta, I refined the AED model I built at CMU, and productionized it in an industrial setting.

The most remarkable improvement to the model was achieved with teacher-student learning. Together with an intern, we curated an unlabeled dataset of acoustic events that was twice the size of Google's AudioSet. Using pseudo-labels generated with a teacher model, I was able to train a student model whose performance surpassed the teacher model. Combined with other techniques such as data augmentation, I was able to improve the mean average precision (MAP) from 0.354 to 0.430.

Upgrading the model in production was also quite an effort, as the model supported 40 direct clients and hundreds of indirect clients. I needed to ensure the upgrade did not interrupt any of the downstream workflows, and brought improvements to the key applications. By designing a mechanism to allow access to both versions of the model in the interim, and proactively coordinating experiments with the client teams to demonstrate improvements, I was able to complete the upgrade in about half a year.

Even today, the AED model is running on all Facebook and Instagram videos, and supporting critical applications such as video understanding, ads, and integrity.

Main publications: [Interspeech21a], [ICASSP22].

Other publications: [IJCNN21], [2110.03174], [Interspeech22], [ICASSP23].

Sound Event Detection with Weak Labeling (05/2015 - 10/2018)

Sound event detection (SED) is the task of classifying and localizing semantically meaningful units of sounds, such as car engine noise and dog barks, in audio streams. Because it is expensive to obtain strong labeling that specifies the onset and offset times of each event occurrence, I focused on how to train SED systems with weak labeling in which the temporal information is incomplete. I dealt with two types of weak labeling: (1) for presence/absence labeling which provides no temporal information at all, I built SED systems using the multiple instance learning (MIL) framework and compared many types of pooling functions; (2) for sequential labeling which specifies the order of event boundaries, I adapted the connectionist temporal classification (CTC) framework for speech recognition and used it for SED.

In the first year of this project, I also worked on multimedia event detection, which aims to understand what is the high-level event happening in a video or audio recording, such as a soccer game or a birthday party.

Main publications: [ICASSP16], [ICASSP17], [Interspeech17], [Interspeech18a], [ICASSP19a], [ICASSP19b].

Other publications: [ICMR16], [ICASSP18], [Interspeech18b], [ESWA19].

PhD thesis proposal: [CMU-proposal].

PhD thesis: [CMU-thesis].

Keyword Search in Low-Resource Languages (03/2012 - 12/2014)

I participated in the IARPA Babel project, whose goal was to build speech recognizers and keyword detection systems for low-resource languages such as Pashto and Tagalog. I rewrote the code to generate confusion networks (which can be regarded as indices for keyword detection) from lattices (which are the output of speech recognizers), making it shorter, faster and better than before. I also maintained the keyword detection toolkit, and built web-based error analysis tools for the Radical team, of which CMU was a part.

Publications: [Interspeech14a], [Interspeech14b], [SLT14], [ICASSP15].

Robust Open-Set Speaker Identification (10/2010 - 02/2012)

I built an open-set speaker identification system. To deal with noise and channel mismatches, I explored an extensive variety of acoustic features. I also compared several speaker modeling techniques, including GMM-UBM (Gaussian mixture models - universal background model), GMM-SVM (support vector machines using GMM supervectors), and JFA (joint factor analysis, a precursor to the i-vector technique which soon became the mainstream).

Technical report: [IROSIS].

Separating Singing Voice and Accompaniment from Monaural Audio (02/2010 - 06/2010)

I built a system to separate the singing voice and the accompaniment from real-world music files. It first extracts the melody using a hidden Markov model (HMM) and features based on harmonic summation, then separates the singing voice and accompaniment using non-negative matrix factorization (NMF). Evaluation on two public databases shows that the system reaches state-of-the-art performance. The separated accompaniment has a quality high enough to be used in singing performances.

This project is open-source on GitHub, and a demo of some separation results can be found here.

Publications: [THU-thesis], [ICASSP11a].

Bit Vector Indexing for Query by Humming (09/2007 - 06/2008)

Query by humming (QBH) is a form of music retrieval, where users can search for music by humming a piece of melody. It is supported by websites such as SoundHound. In QBH systems, music clips are represented by their "fingerprints", which are points in a high-dimensional space, and bit vector indices (BVI) can be used to speed up the search in this space. I built a bit vector indexing system which ran 60 times faster than a naïve one-by-one comparison.

Japanese Vocal Music Synthesis (01/2007 - 02/2007)

I built a simple system to synthesize vocal music in Japanese, by adjusting the pitch and duration of pre-recorded syllables according to the music score. I did this before even learning about the Fourier transform. You can listen to some synthesized songs below:

桜色の季節 (The Cherry-Colored Season), an original song by me;
春の呼び声 (The Call of Spring), an original song by me;
胸がドキドキ (A Pounding Heart), the first opening song of Detective Conan.

This project is open-source on GitHub.

Publications

[ICASSP24]	Wonjune Kang, Yun Wang, Shun Zhang, Arthur Hinsvark, Qing He, "Multi-task learning for front-end text processing in TTS", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10796-10800, Apr. 2024. (also available on arXiv)
[ICASSP23]	Yuanbo Hou, Yun Wang, Wenwu Wang, Dick Botteldooren, "GCT: Gated contextual transformer for sequential audio tagging", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2023.
[2210.16045]	Jason Fong, Yun Wang, Prabhav Agrawal, Vimal Manohar, Jilong Wu, Thilo Köhler, Qing He, "Towards zero-shot text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders", arXiv:2210.16045, Oct. 2022.
[Interspeech22]	Yuanbo Hou, Zhaoyi Liu, Bo Kang, Yun Wang, Dick Botteldooren, "CT-SAT: Contextual transformer for sequential audio tagging", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 4147-4151, Sep. 2022. (also available on arXiv)
[ICASSP22]	Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika Singh, Yatharth Saraf, "Conformer-based self-supervised learning for non-speech audio tasks", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8862-8866, May 2022. (also available on arXiv)
[2110.03174]	Dawei Liang, Yangyang Shi, Yun Wang, Nayan Singhal, Alex Xiao, Jonathan Shaw, Edison Thomaz, Ozlem Kalinli, Mike Seltzer, "Transferring voice knowledge for acoustic event detection: An empirical study", arXiv:2110.03174, Oct. 2021.
[Interspeech21a]	Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, and Christian Fuegen, "Do sound event representations generalize to other audio tasks? A case study in audio transfer learning", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1214-1218, Aug.-Sep. 2021. (also available on arXiv)
[Interspeech21b]	Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen, "A two-stage approach to speech bandwidth extension", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1689-1693, Aug.-Sep. 2021.
[IJCNN21]	Yuzhuo Liu, Hangting Chen, Yun Wang, and Pengyuan Zhang, "Power pooling: An adaptive pooling function for weakly labelled sound event detection", in Proceedings of the International Joint Conference on Neural Networks (IJCNN), Jul. 2021. (also available on arXiv)
[ICASSP21]	Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen, "A time-domain convolutional recurrent network for packet loss concealment", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7133-7137, Jun. 2021.
[SLT21]	Ashutosh Pandey, Chunxi Liu, Yun Wang, and Yatharth Saraf, "Dual application of speech enhancement for automatic speech recognition", in Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 223-228, Jan. 2021. (also available on arXiv)
[ICASSP19a]	Yun Wang, Juncheng Li, and Florian Metze, "A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31-35, May 2019. (also available on arXiv)
[ICASSP19b]	Yun Wang and Florian Metze, "Connectionist temporal localization for sound event detection with sequential labeling", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 745-749, May 2019. (also available on arXiv)
[ESWA19]	Pierre Laffitte, Yun Wang, David Sodoyer, and Laurent Girin, "Assessing the performances of different neural network architectures for the detection of screams and shouts in public transportation", in Expert Systems With Applications, vol. 117, pp. 29-41, Mar. 2019.
[CMU-thesis]	Yun Wang, "Polyphonic sound event detection with weak labeling", PhD thesis, Carnegie Mellon University, Oct. 2018.
[Interspeech18a]	Yun Wang, Juncheng Li, and Florian Metze, "Comparing the max and noisy-or pooling functions in multiple instance learning for weakly supervised sequence learning tasks", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1339-1343, Sep. 2018. (also available on arXiv)
[Interspeech18b]	Shao-Yen Tseng, Juncheng Li, Yun Wang, Florian Metze, Joseph Szurley, and Samarjit Das, "Multiple instance deep learning for weakly supervised small-footprint audio event detection", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 3279-3283, Sep. 2018.
[Interspeech18c]	Adrien Le Franc, Eric Riebling, Julien Karadayi, Yun Wang, Camila Scaff, Florian Metze, and Alejandrina Cristia, "The ACLEW DiViMe: An easy-to-use diarization tool", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1383-1387, Sep. 2018.
[ICASSP18]	Juncheng Li, Yun Wang, Joseph Szurley, Florian Metze, and Samarjit Das, "A light-weight multimodal framework for improved environmental audio tagging", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6832-6836, Apr. 2018.
[CMU-proposal]	Yun Wang, "Polyphonic sound event detection with weak labeling", PhD thesis proposal, Carnegie Mellon University, Oct. 2017.
[Interspeech17]	Yun Wang and Florian Metze, "A transfer learning based feature extractor for polyphonic sound event detection using connectionist temporal classification", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 3097-3101, Aug. 2017.
[ICASSP17]	Yun Wang and Florian Metze, "A first attempt at polyphonic sound event detection using connectionist temporal classification", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2986-2990, Mar. 2017.
[ICMR16]	Yun Wang and Florian Metze, "Recurrent support vector machines for audio-based multimedia event detection", in Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR), pp. 265-269, Jun. 2016.
[ICASSP16]	Yun Wang, Leonardo Neves, and Florian Metze, "Audio-based multimedia event detection using deep recurrent neural networks", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2742-2746, Mar. 2016.
[ISMIR15]	Guangyu Xia, Yun Wang, Roger Dannenberg, and Geoffrey Gordon, "Spectral learning for expressive interactive ensemble music performance", in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 816-822, Oct. 2015.
[ICASSP15]	Florian Metze, Ankur Gandhe, Yajie Miao, Zaid Sheikh, Yun Wang, Di Xu, Hao Zhang, Jungsuk Kim, Ian Lane, Won Kyum Lee, Sebastian Stüker, and Markus Müller, "Semi-supervised training in low-resource ASR and KWS", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4699-4703, Apr. 2015.
[SLT14]	Di Xu, Yun Wang, and Florian Metze, "EM-based phoneme confusion matrix generation for low-resource spoken term detection", in Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 424-429, Dec. 2014.
[Interspeech14a]	Yun Wang and Florian Metze, "An in-depth comparison of keyword specific thresholding and sum-to-one score normalization", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2474-2478, Sep. 2014.
[Interspeech14b]	Justin Chiu, Yun Wang, Jan Trmal, Daniel Povey, Guoguo Chen, and Alexander Rudnicky, "Combination of FST and CN search in spoken term detection", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2784-2788, Sep. 2014.
[IROSIS]	Qin Jin and Yun Wang, "Integrated robust open-set speaker identification system (IROSIS)", CMU technical report, May 2012.
[ICASSP11a]	Yun Wang and Zhijian Ou, "Combining HMM-based melody extraction and NMF-based soft masking for separating voice and accompaniment from monaural audio", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-4, May 2011.
[ICASSP11b]	Angeliki Metallinou, Athanassios Katsamanis, Yun Wang, and Shrikanth Narayanan, "Tracking changes in continuous emotion states using body language and prosodic cues", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2288-2291, May 2011.
[THU-thesis]	Yun Wang, "Separating singing voice and accompaniment from monaural audio", bachelor's thesis, Tsinghua University, Jun. 2010. (in Chinese)

11/2018 - 10/2025	Research Scientist, Voice Modeling Team, Meta Superintelligence Lab (MSL)
01/2015 - 04/2015	Software Engineer Intern, Language Technology Team

08/2012 - 10/2018	PhD, Language Technologies Institute (LTI) (Advisor: Florian Metze)
08/2010 - 08/2012	Master of Science, Language Technologies Institute (LTI) (Advisor: Qin Jin)

Work Experience

Meta (formerly Facebook)

Education

Carnegie Mellon University

Tsinghua University