
demidovd98/sm-vit
📦 Open Source Projectdemidovd98
A Vision Transformer architecture utilizing salient mask guidance to improve fine-grained image classification accuracy.
SM-ViT addresses the challenge of fine-grained classification by incorporating a Salient Mask-Guided module into the Vision Transformer (ViT) architecture. Unlike standard ViTs that treat all image patches with equal importance, SM-ViT employs a saliency-guided mechanism to highlight informative regions. By masking out non-essential background noise and emphasizing salient features, the model achieves higher precision in distinguishing between visually similar categories. The repository provides the official implementation, including training scripts and model architecture definitions in Python. It is designed for researchers and developers working on computer vision tasks that require high-level feature extraction from complex, fine-grained datasets. The project demonstrates how spatial attention can be augmented with saliency priors to boost the robustness and interpretability of transformer-based vision models.
💡Highlights
- ├─Saliency-guided patch selection
- ├─Optimized for fine-grained tasks
- └─VISIGRAPP '23 architecture
🎯For
- ├─Computer Vision Researchers
- └─Machine Learning Engineers