AI GLOSSARY

Vision Transformer

ViTNeural Network Architectures

A transformer architecture adapted for image processing, where images are divided into fixed-size patches that are treated as tokens, analogous to words in a language model. Vision transformers have shown that the transformer architecture, originally designed for language, can match or exceed convolutional neural networks on image tasks when trained on sufficient data, and have become a leading architecture in computer vision.

External reference