csinva/interpretable-embeddings

📦 Open Source Projectcsinva

Transform black-box text embeddings into human-interpretable features by leveraging LLMs to answer binary questions.

The interpretable-embeddings framework addresses the 'black-box' nature of standard transformer-based embeddings. By utilizing LLMs as feature extractors, the system constructs embedding spaces where each dimension corresponds to a specific, human-readable concept or question. This methodology allows researchers to decompose complex vector representations into interpretable components, facilitating better debugging of RAG systems, improved analysis of neural encoding models in neuroscience, and more transparent model behavior. The repository provides the necessary Python tools to implement this approach, enabling users to map arbitrary text inputs into a space defined by semantic queries. It is highly effective for tasks requiring interpretability without sacrificing the performance of large-scale language models, offering a robust alternative to traditional dense embedding methods.

💡Highlights

├─NeurIPS 2024 accepted research
├─Human-interpretable embedding space
└─LLM-based binary feature extraction

🎯For

├─AI Researchers
├─Data Scientists
└─Computational Neuroscientists

🔗Links

└─GitHub Repository