IDEA-Research/grounding-dino-base

🧠 AI ModelIDEA-Research

Detect any object from text prompts without training—zero-shot detection.

Grounding DINO (base version) is a transformer-based model that extends the DINO object detection framework with grounding capabilities. It takes an image and a free-form text prompt (e.g., 'a red car') and outputs bounding boxes and labels for objects matching the description. The architecture integrates a text encoder (BERT) and a visual encoder (Swin Transformer) with cross-modality fusing layers, allowing rich interaction between textual and visual features. Trained on large-scale datasets combining detection and grounding data, it generalizes well to unseen categories. Key innovations include a feature enhancer module and a language-guided query selection mechanism. With over 1.5 million downloads, it is widely used for flexible detection tasks in robotics, image search, and visual grounding.

💡Highlights

├─Zero-shot detection from text prompts
├─Combines DINO detection with language grounding
└─1.5M+ downloads, Apache-2.0

🎯For

├─Computer vision researchers
├─AI engineers
└─Robotics developers

🔗Links

└─HuggingFace Model Page