Zizhao Zhang, Staff Research Software Engineer and Tech Lead for Cloud AI Research at Google, presents the “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding” tutorial at the May 2022 Embedded Vision Summit.
In computer vision, hierarchical structures are popular in vision transformers (ViT). In this talk, Zhang presents a novel idea of nesting canonical local transformers on non-overlapping image blocks and aggregating them hierarchically. This new design, named NesT, leads to a simplified architecture compared with existing hierarchical structured designs, and requires only minor code changes relative to the original ViT.
The benefits of the proposed judiciously-selected design are threefold:
- NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets
- When extending key ideas to image generation, NesT leads to a strong decoder that is 8X faster than previous transformer-based generators, and
- Decoupling the feature learning and abstraction processes via the nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model.
See here for a PDF of the slides.