IDEA-Research/grounding-dino-base
🧠 AI ModelIDEA-Research
Detect any object from text prompts without training—zero-shot detection.
Grounding DINO (base version) is a transformer-based model that extends the DINO object detection framework with grounding capabilities. It takes an image and a free-form text prompt (e.g., 'a red car') and outputs bounding boxes and labels for objects matching the description. The architecture integrates a text encoder (BERT) and a visual encoder (Swin Transformer) with cross-modality fusing layers, allowing rich interaction between textual and visual features. Trained on large-scale datasets combining detection and grounding data, it generalizes well to unseen categories. Key innovations include a feature enhancer module and a language-guided query selection mechanism. With over 1.5 million downloads, it is widely used for flexible detection tasks in robotics, image search, and visual grounding.
💡Highlights
- ├─Zero-shot detection from text prompts
- ├─Combines DINO detection with language grounding
- └─1.5M+ downloads, Apache-2.0
🎯For
- ├─Computer vision researchers
- ├─AI engineers
- └─Robotics developers