Yun Wang (Maigo)
Email Facebook Google Scholar LinkedIn GitHub Zhihu

Yun Wang (Maigo)

I am currently a research scientist in the speech team of Facebook AI Applied Research. I maintain the sound event detection system at Facebook.

I graduated as a PhD from the Language Technologies Institute (LTI) of Carnegie Mellon University (CMU) in October 2018. I worked with Prof. Florian Metze on sound event detection.

My research interests also include speech recognition and machine learning.

Work Experience

Facebook

Facebook, Inc.

11/2018 ~Research scientist, speech, Facebook AI Applied Research
01/2015 ~ 04/2015Software engineer intern, Language Technology group

Education

Carnegie Mellon University

Carnegie Mellon University

08/2012 ~ 10/2018PhD student at Language Technologies Institute (LTI)
08/2010 ~ 08/2012Master student at Language Technologies Institute (LTI)
Tsinghua University

Tsinghua University

08/2006 ~ 07/2010Undergraduate student in the Dept. of Electronic Engineering

Research Projects

Sound Event Detection with Weak Labeling

Sound Event Detection with Weak Labeling (05/2015 - 10/2018)

Sound event detection (SED) is the task of classifying and localizing semantically meaningful units of sounds, such as car engine noise and dog barks, in audio streams. Because it is expensive to obtain strong labeling that specifies the onset and offset times of each event occurrence, I focused on how to train SED systems with weak labeling in which the temporal information is incomplete. I dealt with two types of weak labeling: (1) for presence/absence labeling which provides no temporal information at all, I built SED systems using the multiple instance learning (MIL) framework and compared many types of pooling functions; (2) for sequential labeling which specifies the order of event boundaries, I adapted the connectionist temporal classification (CTC) framework for speech recognition and used it for SED.

In the first year of this project, I also worked on multimedia event detection, which aims to understand what is the high-level event happening in a video or audio recording, such as a soccer game or a birthday party.

Main publications: [ICASSP16], [ICASSP17], [Interspeech17], [Interspeech18a], [ICASSP19a], [ICASSP19b].

Other publications: [ICMR16], [ICASSP18], [Interspeech18b], [ESWA19].

PhD thesis proposal: [CMU-proposal].

PhD thesis: [CMU-thesis].

Keyword Search in Low-Resource Languages

Keyword Search in Low-Resource Languages (03/2012 - 12/2014)

I participated in the IARPA Babel project, whose goal was to build speech recognizers and keyword detection systems for low-resource languages such as Pashto and Tagalog. I rewrote the code to generate confusion networks (which can be regarded as indices for keyword detection) from lattices (which are the output of speech recognizers), making it shorter, faster and better than before. I also maintained the keyword detection toolkit, and built web-based error analysis tools for the Radical team, of which CMU was a part.

Publications: [Interspeech14a], [Interspeech14b], [SLT14], [ICASSP15].

Robust Open-Set Speaker Identification

Robust Open-Set Speaker Identification (10/2010 - 02/2012)

I built an open-set speaker identification system. To deal with noise and channel mismatches, I explored an extensive variety of acoustic features. I also compared several speaker modeling techniques, including GMM-UBM (Gaussian mixture models - universal background model), GMM-SVM (support vector machines using GMM supervectors), and JFA (joint factor analysis, a precursor to the i-vector technique which soon became the mainstream).

Technical report: [IROSIS].

Separating Singing Voice and Accompaniment from Monaural Audio

Separating Singing Voice and Accompaniment from Monaural Audio (02/2010 - 06/2010)

I built a system to separate the singing voice and the accompaniment from real-world music files. It first extracts the melody using a hidden Markov model (HMM) and features based on harmonic summation, then separates the singing voice and accompaniment using non-negative matrix factorization (NMF). Evaluation on two public databases shows that the system reaches state-of-the-art performance. The separated accompaniment has a quality high enough to be used in singing performances.

This project is open-source on GitHub, and a demo of some separation results can be found here.

Publications: [THU-thesis], [ICASSP11a].

Bit Vector Indexing for Query by Humming

Bit Vector Indexing for Query by Humming (09/2007 - 06/2008)

Query by humming (QBH) is a form of music retrieval, where users can search for music by humming a piece of melody. It is supported by websites such as SoundHound. In QBH systems, music clips are represented by their "fingerprints", which are points in a high-dimensional space, and bit vector indices (BVI) can be used to speed up the search in this space. I built a bit vector indexing system which ran 60 times faster than a naïve one-by-one comparison.

Japanese Vocal Music Synthesis

Japanese Vocal Music Synthesis (01/2007 - 02/2007)

I built a simple system to synthesize vocal music in Japanese, by adjusting the pitch and duration of pre-recorded syllables according to the music score. I did this before even learning about the Fourier transform. You can listen to some synthesized songs below:

This project is open-source on GitHub.

Publications

[Interspeech21a] Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, and Christian Fuegen, "Do sound event representations generalize to other audio tasks? A case study in audio transfer learning", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1214-1218, Aug.-Sep. 2021. (also available on arXiv)
[Interspeech21b] Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen, "A two-stage approach to speech bandwidth extension", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1689-1693, Aug.-Sep. 2021.
[IJCNN21] Yuzhuo Liu, Hangting Chen, Yun Wang, and Pengyuan Zhang, "Power pooling: An adaptive pooling function for weakly labelled sound event detection", in Proceedings of the International Joint Conference on Neural Networks (IJCNN), Jul. 2021. (also available on arXiv)
[ICASSP21] Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen, "A time-domain convolutional recurrent network for packet loss concealment", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7133-7137, Jun. 2021.
[SLT21] Ashutosh Pandey, Chunxi Liu, Yun Wang, and Yatharth Saraf, "Dual application of speech enhancement for automatic speech recognition", in Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 223-228, Jan. 2021. (also available on arXiv)
[ICASSP19a] Yun Wang, Juncheng Li, and Florian Metze, "A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31-35, May 2019. (also available on arXiv)
[ICASSP19b] Yun Wang and Florian Metze, "Connectionist temporal localization for sound event detection with sequential labeling", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 745-749, May 2019. (also available on arXiv)
[ESWA19] Pierre Laffitte, Yun Wang, David Sodoyer, and Laurent Girin, "Assessing the performances of different neural network architectures for the detection of screams and shouts in public transportation", in Expert Systems With Applications, vol. 117, pp. 29-41, Mar. 2019.
[CMU-thesis] Yun Wang, "Polyphonic sound event detection with weak labeling", PhD thesis, Carnegie Mellon University, Oct. 2018.
[Interspeech18a] Yun Wang, Juncheng Li, and Florian Metze, "Comparing the max and noisy-or pooling functions in multiple instance learning for weakly supervised sequence learning tasks", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1339-1343, Sep. 2018. (also available on arXiv)
[Interspeech18b] Shao-Yen Tseng, Juncheng Li, Yun Wang, Florian Metze, Joseph Szurley, and Samarjit Das, "Multiple instance deep learning for weakly supervised small-footprint audio event detection", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 3279-3283, Sep. 2018.
[Interspeech18c] Adrien Le Franc, Eric Riebling, Julien Karadayi, Yun Wang, Camila Scaff, Florian Metze, and Alejandrina Cristia, "The ACLEW DiViMe: An easy-to-use diarization tool", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1383-1387, Sep. 2018.
[ICASSP18] Juncheng Li, Yun Wang, Joseph Szurley, Florian Metze, and Samarjit Das, "A light-weight multimodal framework for improved environmental audio tagging", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6832-6836, Apr. 2018.
[CMU-proposal] Yun Wang, "Polyphonic sound event detection with weak labeling", PhD thesis proposal, Carnegie Mellon University, Oct. 2017.
[Interspeech17] Yun Wang and Florian Metze, "A transfer learning based feature extractor for polyphonic sound event detection using connectionist temporal classification", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 3097-3101, Aug. 2017.
[ICASSP17] Yun Wang and Florian Metze, "A first attempt at polyphonic sound event detection using connectionist temporal classification", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2986-2990, Mar. 2017.
[ICMR16] Yun Wang and Florian Metze, "Recurrent support vector machines for audio-based multimedia event detection", in Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR), pp. 265-269, Jun. 2016.
[ICASSP16] Yun Wang, Leonardo Neves, and Florian Metze, "Audio-based multimedia event detection using deep recurrent neural networks", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2742-2746, Mar. 2016.
[ISMIR15] Guangyu Xia, Yun Wang, Roger Dannenberg, and Geoffrey Gordon, "Spectral learning for expressive interactive ensemble music performance", in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 816-822, Oct. 2015.
[ICASSP15] Florian Metze, Ankur Gandhe, Yajie Miao, Zaid Sheikh, Yun Wang, Di Xu, Hao Zhang, Jungsuk Kim, Ian Lane, Won Kyum Lee, Sebastian Stüker, and Markus Müller, "Semi-supervised training in low-resource ASR and KWS", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4699-4703, Apr. 2015.
[SLT14] Di Xu, Yun Wang, and Florian Metze, "EM-based phoneme confusion matrix generation for low-resource spoken term detection", in Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 424-429, Dec. 2014.
[Interspeech14a] Yun Wang and Florian Metze, "An in-depth comparison of keyword specific thresholding and sum-to-one score normalization", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2474-2478, Sep. 2014.
[Interspeech14b] Justin Chiu, Yun Wang, Jan Trmal, Daniel Povey, Guoguo Chen, and Alexander Rudnicky, "Combination of FST and CN search in spoken term detection", in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2784-2788, Sep. 2014.
[IROSIS] Qin Jin and Yun Wang, "Integrated robust open-set speaker identification system (IROSIS)", CMU technical report, May 2012.
[ICASSP11a] Yun Wang and Zhijian Ou, "Combining HMM-based melody extraction and NMF-based soft masking for separating voice and accompaniment from monaural audio", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-4, May 2011.
[ICASSP11b] Angeliki Metallinou, Athanassios Katsamanis, Yun Wang, and Shrikanth Narayanan, "Tracking changes in continuous emotion states using body language and prosodic cues", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2288-2291, May 2011.
[THU-thesis] Yun Wang, "Separating singing voice and accompaniment from monaural audio", bachelor's thesis, Tsinghua University, Jun. 2010. (in Chinese)

Miscellaneous

I know seven languages: 汉语, English, 日本語, 한국어, español, français, and tiếng Việt. And I can sing in 粵語.

I am proficient in the following programming languages: C, Java, Python and Matlab.

I am the author of the Android app MCPDict (漢字古今中外讀音查詢). With this app, you can look up the pronunciation of Chinese characters in Middle Chinese, various modern Chinese dialects, and other languages in East Asia. This app is open-source on GitHub; an iOS version and a web version have been developed by collaborators.

I am an active user of Zhihu (知乎, a Chinese Q&A website like Quora). I answer questions and write articles on languages and linguistics, math, and machine learning.