"Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding," a Presentation from Google

Zizhao Zhang, Staff Research Software Engineer and Tech Lead for Cloud AI Research at Google, presents the “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding” tutorial at the May 2022 Embedded Vision Summit.

In computer vision, hierarchical structures are popular in vision transformers (ViT). In this talk, Zhang presents a novel idea of nesting canonical local transformers on non-overlapping image blocks and aggregating them hierarchically. This new design, named NesT, leads to a simplified architecture compared with existing hierarchical structured designs, and requires only minor code changes relative to the original ViT.

The benefits of the proposed judiciously-selected design are threefold:

NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets
When extending key ideas to image generation, NesT leads to a strong decoder that is 8X faster than previous transformer-based generators, and
Decoupling the feature learning and abstraction processes via the nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model.

See here for a PDF of the slides.

If you're building AI or vision-enabled products, you've come to the right place.

“Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding,” a Presentation from Google

Pages

Topics

Contact

Address

Phone