google/gemma-3-12b-it

🧠 AI Modelgoogle

Google's open-source 12B multimodal model for image and text conversation.

Google's Gemma 3 12B IT (Instruction Tuned) is a state-of-the-art open-source multimodal model that processes both images and text, enabling conversational AI with visual understanding. With 12 billion parameters, it uses a transformer architecture with safetensors, and is trained on a diverse dataset. It supports the image-text-to-text pipeline, making it suitable for tasks like visual question answering, image captioning, and multimodal dialogue. The model is gated, meaning users must accept license terms before access, but is fully open-source. It references key research papers (e.g., arxiv:1905.07830, arxiv:1905.10044) and is part of Google's Gemma family, offering a balance of performance and efficiency for multimodal reasoning. Features include multilingual support, fine-tuning capabilities, and compatibility with the Hugging Face Transformers ecosystem.

💡Highlights

├─12B parameters, multimodal
├─Image-text-to-text pipeline
└─Open source from Google

🎯For

├─AI researchers
├─multimodal developers
└─open-source enthusiasts

🔗Links

└─Hugging Face Model Page