Jitendra Malik, Arthur J. Chick Professor and Chair of the Department of Electrical Engineering and Computer Science at the University of California, Berkeley, presents the "Deep Visual Understanding from Deep Learning" tutorial at the May 2017 Embedded Vision Summit.
Deep learning and neural networks coupled with high-performance computing have led to remarkable advances in computer vision. For example, we now have a good capability to detect and localize people or objects and determine their 3D pose and layout in a scene. But we are still quite short of "visual understanding," a much larger problem.
For example, vision helps guide manipulation and locomotion, and this requires building dynamic models of consequences of various actions. Further, we should not just detect people, objects and actions but also link them together, by what we call "visual semantic role labeling," essentially identifying subject-verb-object relationships. And finally, we should be able to make predictions – what will happen next in a video stream? In this talk, Professor Malik reviews progress in deep visual understanding, gives an overview of the state of the art, and shows a tantalizing glimpse into what the future holds.