visual Onoma-to-wave

We propose a method for synthesizing environmental sounds from visually represented onomatopoeias and sound sources. A visual onomatopoeia (visual text of onomatopoeia) contains rich information that is not present in the text, such as a long-short duration of the image, so the use of this representation is expected to synthesize diverse sounds. The proposed method transfers visual concepts of the visual text and sound-source image to the synthesized sound. We also propose a data augmentation method focusing on the repetition of onomatopoeias to enhance the performance of our method.

Demo page:


GitHub repository: