Sound event synthesis using WaveNet

This is a demonstration of sound event synthesis (SES) using event labels based on the conditional WaveNet [1]. As the dataset, we used 10 different sound events (manual coffee grinder, cup clinking, alarm clock ringing, whistle, maracas, drum, electric shaver, trash box banging, tearing paper, bell ringing) contained in the RWCP-SSD (Real World Computing Partnership-Sound Scene Database) [2].

You can download a zip file of original and synthesized sounds from here.


・Manual coffee grinder

coffee_grinder_original.wav
Original sound


coffee_grinder_generated.wav
Synthesized sound

・Cup

cup_original.wav
Original sound


cup_generated.wav
Synthesized sound

・Clock

clock_original.wav
Original sound


clock_generated.wav
Synthesized sound

・Whistle

whistle_original.wav
Original sound


whistle_generated.wav
Synthesized sound

・Maracas

maracas_original.wav
Original sound


maracas_genarated.wav
Synthesized sound

・Drum

drum_original.wav
Original sound


drum_generated.wav
Synthesized sound

・Shaver

shaver_original.wav
Original sound


shaver_generated.wav
Synthesized sound

・Trash box

trashbox_original.wav
Original sound


trashbox_generated.wav
Synthesized sound

・Tearing paper

tearing_original.wav
Original sound


tearing_generated.wav
Synthesized sound

・Bell

bell_original.wav
Original sound


bell_generated.wav
Synthesized sound
[1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv preprint, arXiv:1609.03499, 2016.[2] S. Nakamura, K. Hiyane, F. Asano, and T. Endo, “Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-free Speech Recognition,” Proc. LanguageResources and Evaluation Conference (LREC), pp. 965–968, 2000.