“Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding,” a Presentation from Google

Zizhao Zhang, Staff Research Software Engineer and Tech Lead for Cloud AI Research at Google, presents the “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding” tutorial at the May 2022 Embedded Vision Summit.

In computer vision, hierarchical structures are popular in vision transformers (ViT). In this talk, Zhang presents a novel idea of nesting canonical local transformers on non-overlapping image blocks and aggregating them hierarchically. This new design, named NesT, leads to a simplified architecture compared with existing hierarchical structured designs, and requires only minor code changes relative to the original ViT.

The benefits of the proposed judiciously-selected design are threefold:

  1. NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets
  2. When extending key ideas to image generation, NesT leads to a strong decoder that is 8X faster than previous transformer-based generators, and
  3. Decoupling the feature learning and abstraction processes via the nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model.

See here for a PDF of the slides.

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

Berkeley Design Technology, Inc.
PO Box #4446
Walnut Creek, CA 94596

Phone
Phone: +1 (925) 954-1411
Scroll to Top