Back to glossary
AI GLOSSARY
Vision-Language Model
VLMNeural Network Architectures
A multimodal model that jointly processes and reasons about both images and text, enabling capabilities like visual question answering, image captioning, and document understanding. VLMs are trained to align visual and linguistic representations so that the model can connect what it sees with what it reads, a capability increasingly central to real-world AI applications.