PaliGemma
Google's open-source multimodal model combining SigLIP vision encoder with Gemma LLM. Strong at document understanding, OCR, image captioning, and visual QA. Available via HuggingFace.
Qwen-VL
Qwen Visual Language model series from Alibaba. Strong at multilingual visual understanding, document parsing, and chart reading. Available as open weights on HuggingFace. Runs via vLLM.