Back to glossary

AI GLOSSARY

Vision-Language Model

VLMNeural Network Architectures

A multimodal model that jointly processes and reasons about both images and text, enabling capabilities like visual question answering, image captioning, and document understanding. VLMs are trained to align visual and linguistic representations so that the model can connect what it sees with what it reads, a capability increasingly central to real-world AI applications.