
GramosoftAI/GcrawlAI
📦 Open Source ProjectGramosoftAI
An open-source, distributed web crawler designed to convert complex websites into clean, LLM-ready Markdown data.
GcrawlAI is a robust, Python-based data extraction pipeline built to bridge the gap between raw web content and AI model consumption. At its core, it leverages Playwright for browser automation, allowing it to handle dynamic, JavaScript-heavy websites that traditional scrapers often miss. The architecture is highly scalable, utilizing Celery for distributed task management and Redis for message brokering, which enables high-throughput crawling operations.
Key technical features include a dedicated 'stealth mode' to minimize the risk of IP blocking or bot detection, and a real-time WebSocket interface that provides live updates on crawling progress. The tool automatically cleans and formats extracted content into LLM-friendly Markdown, stripping away boilerplate and noise. This makes GcrawlAI an ideal component for RAG (Retrieval-Augmented Generation) pipelines, as it ensures that the data injected into vector databases is consistent, readable, and optimized for token efficiency. The project is designed for modularity, allowing developers to integrate it into existing FastAPI or Streamlit workflows with minimal friction.
💡Highlights
- ├─Distributed crawling with Celery
- ├─Stealth mode for bot evasion
- └─Clean Markdown output for LLMs
🎯For
- ├─AI Engineers
- ├─Data Scientists
- └─RAG Pipeline Developers