Pixtral
Mistral's vision-language model available via Mistral API and as open weights. Supports multiple images per prompt, high-resolution understanding, and code extraction from screenshots.
Qwen-VL
Qwen Visual Language model series from Alibaba. Strong at multilingual visual understanding, document parsing, and chart reading. Available as open weights on HuggingFace. Runs via vLLM.