Connecting To The Server To Fetch The WebPage Elements!!....
MXPlank.com MXMail Submit Research Thesis Electronics - MicroControllers Contact us QuantumDDX.com




Search The Site





 

Class-aware Sounding Objects Localization via Audiovisual Correspondence


Humans can easily localize sounding objects and recognize their categories. A recent paper published on arXiv.org investigates how machine intelligence could also benefit from such audiovisual correspondence.




Image credit: Wikimedia Commons, Public Domain via Rawpixel

The researchers propose a two-stage step-by-step learning framework to pursue class-aware sounding objects localization, starting from single sound scenarios and then expanding to cocktail-party cases.


The correspondence between object visual representations and categories knowledge is gained using only the alignment between audio and vision as the supervision. The curriculum allows filtering out silent objects in complex scenarios. Experiments show that the method solves the task in music scenes as well as in harder cases where the same object can produce different sounds. Furthermore, the object localization framework learned from audiovisual consistency can be applied to the object detection task.


Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without category annotations, i.e., localizing the sounding object and recognizing its category. To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision. First, we propose to determine the sounding area via coarse-grained audiovisual correspondence in the single source cases. Then visual features in the sounding area are leveraged as candidate object representations to establish a category-representation object dictionary for expressive visual character extraction. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas by referring to this dictionary. Finally, we employ category-level audiovisual consistency as the supervision to achieve fine-grained audio and sounding object distribution alignment. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones. We also transfer the learned audiovisual network into the unsupervised object detection task, obtaining reasonable performance.