visual Onoma-to-wave

We propose a method for synthesizing environmental sounds from visually represented onomatopoeias and sound sources. A visual onomatopoeia (visual text of onomatopoeia) contains rich information that is not present in the text, such as a long-short duration of the image, so the use of this representation is expected to synthesize diverse sounds. The proposed method transfers visual concepts of the visual text and sound-source image to the synthesized sound. We also propose a data augmentation method focusing on the repetition of onomatopoeias to enhance the performance of our method.

Demo page: https://sarulab-speech.github.io/demo_visual-onoma-to-wave/

Paper: https://arxiv.org/abs/2210.09173

GitHub repository: https://github.com/sarulab-speech/visual-onoma-to-wave