Spaces:

doggdad
/

multimodal-rag

Sleeping

App Files Files Community

doggdad commited on Aug 26, 2025

Commit

2e0232c

verified ·

1 Parent(s): f229821

Update README.md

Browse files

Files changed (1) hide show

README.md +39 -39

README.md CHANGED Viewed

@@ -1,51 +1,51 @@
-# Multimodal RAG with BridgeTower Model
-## Description
-This repository contains the complete code and tutorials for implementing a multimodal retrieval-augmented generation (RAG) system capable of processing, storing, and retrieving video content. The system uses BridgeTower for multimodal embeddings, LanceDB as the vector store, and Pixtral as the conversation LLM.
-## Installation
-To install the necessary dependencies, run the following command:
-```bash
-pip install -r requirements.txt
-```
-## Tutorials
-1. `mm_rag.ipynb`: Complete end-to-end implementation of the multimodal RAG system
-2. `embedding_creation.ipynb`: Deep dive into generating multimodal embeddings using BridgeTower
-3. `vector_store.ipynb`: Detailed guide on setting up and populating LanceDB for vector storage
-4. `preprocessing_video.ipynb`: Comprehensive coverage of video preprocessing techniques, including:
-    * Frame extraction
-    * Transcript processing
-    * Handling videos without transcripts
-    * Transcript optimization strategies
-## Required API Keys
-You'll need to set up the following API keys:
-`MISTRAL_API_KEY` for PixTral model access
-## Data
-The tutorial uses a sample video about a space expedition. You can replace it with any video of your choice, but make sure to:
-* Include a transcript file (.vtt format)
-* Or generate transcripts using Whisper
-* Or use vision language models for caption generation
-## Contributing
-Contributions are welcome! Some areas for improvement include:
-* Adding chat history support
-* Prompt engineering refinements
-* Alternative retrieval strategies
-* Testing different VLMs and embedding models
-To contribute:(for use different Vision Language Model and compare performance)
-1. Fork the repository.
-2. Create a new branch (`git checkout -b feature-branch`).
-3. Commit your changes (`git commit -am 'Add new feature'`).
-4. Push to the branch (`git push origin feature-branch`).
-5. Create a new Pull Request.
-## License
-This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

+---
+title: Multimodal RAG Video Chat
+emoji: 🎬
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: false
+license: apache-2.0
+---
+# 🎬 Multimodal RAG Video Chat
+An interactive application that allows you to chat with YouTube videos using advanced multimodal retrieval-augmented generation (RAG).
+## Features
+- **Video Processing**: Automatically downloads and processes YouTube videos
+- **Multimodal Embeddings**: Uses BridgeTower for joint text-image understanding
+- **Vector Storage**: Stores embeddings in LanceDB for efficient retrieval
+- **Visual Language Model**: Powered by Pixtral for intelligent responses
+- **Interactive Interface**: Chat interface with retrieved video frames display
+## How to Use
+1. **Load Video**: Paste a YouTube URL and click "Process Video"
+2. **Chat**: Ask questions about the video content
+3. **View Results**: See relevant video frames alongside AI responses
+## Technology Stack
+- **Frontend**: Gradio
+- **Embeddings**: BridgeTower (multimodal)
+- **Vector DB**: LanceDB
+- **LLM**: Pixtral-12B (Mistral AI)
+- **Video Processing**: OpenCV, pytube
+## Setup
+You'll need a Mistral AI API key to use this application. Add it as a secret named `MISTRAL_API_KEY`.
+## Architecture
+The system follows a RAG (Retrieval-Augmented Generation) approach:
+1. Videos are processed into frames and transcripts
+2. Multimodal embeddings are created and stored
+3. User queries retrieve relevant video segments
+4. Visual language model generates contextual responses
+Check out the [GitHub repository](https://github.com/daddyofadoggy/multimodal-rag) for more details.