Mumtaz Vauhkonen, Lead Distinguished Scientist and Head of Computer Vision for Cognitive AI in AI&D at Verizon, presents the “Unifying Computer Vision and Natural Language Understanding for Autonomous Systems” tutorial at the May 2022 Embedded Vision Summit.
As the applications of autonomous systems expand, many such systems need the ability to perceive using both vision and language, coherently. For example, some systems need to translate a visual scene into language. Others may need to follow language-based instructions when operating in environments that they understand visually. Or, they may need to combine visual and language inputs to understand their environments.
In this talk, Vauhkonen introduces popular approaches to joint language-vision perception. She also presents a unique deep learning rule-based approach utilizing a universal language object model. This new model derives rules and learns a universal language of object interaction and reasoning structure from a corpus, which it then applies to the objects detected visually. She shows that this approach works reliably for frequently occurring actions. She also shows that this type of model can be localized for specific environments and can communicate with humans and other autonomous systems.
See here for a PDF of the slides.