doggdad commited on
Commit
2e0232c
·
verified ·
1 Parent(s): f229821

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -39
README.md CHANGED
@@ -1,51 +1,51 @@
1
- # Multimodal RAG with BridgeTower Model
 
 
 
 
 
 
 
 
 
 
2
 
3
- ## Description
4
- This repository contains the complete code and tutorials for implementing a multimodal retrieval-augmented generation (RAG) system capable of processing, storing, and retrieving video content. The system uses BridgeTower for multimodal embeddings, LanceDB as the vector store, and Pixtral as the conversation LLM.
5
 
6
- ## Installation
7
- To install the necessary dependencies, run the following command:
8
- ```bash
9
- pip install -r requirements.txt
10
- ```
11
 
12
- ## Tutorials
13
- 1. `mm_rag.ipynb`: Complete end-to-end implementation of the multimodal RAG system
14
- 2. `embedding_creation.ipynb`: Deep dive into generating multimodal embeddings using BridgeTower
15
- 3. `vector_store.ipynb`: Detailed guide on setting up and populating LanceDB for vector storage
16
- 4. `preprocessing_video.ipynb`: Comprehensive coverage of video preprocessing techniques, including:
17
 
18
- * Frame extraction
19
- * Transcript processing
20
- * Handling videos without transcripts
21
- * Transcript optimization strategies
 
22
 
23
- ## Required API Keys
24
- You'll need to set up the following API keys:
25
 
26
- `MISTRAL_API_KEY` for PixTral model access
 
 
27
 
28
- ## Data
29
- The tutorial uses a sample video about a space expedition. You can replace it with any video of your choice, but make sure to:
30
 
31
- * Include a transcript file (.vtt format)
32
- * Or generate transcripts using Whisper
33
- * Or use vision language models for caption generation
 
 
34
 
35
- ## Contributing
36
- Contributions are welcome! Some areas for improvement include:
37
 
38
- * Adding chat history support
39
- * Prompt engineering refinements
40
- * Alternative retrieval strategies
41
- * Testing different VLMs and embedding models
42
 
43
- To contribute:(for use different Vision Language Model and compare performance)
44
- 1. Fork the repository.
45
- 2. Create a new branch (`git checkout -b feature-branch`).
46
- 3. Commit your changes (`git commit -am 'Add new feature'`).
47
- 4. Push to the branch (`git push origin feature-branch`).
48
- 5. Create a new Pull Request.
49
 
50
- ## License
51
- This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
 
 
 
 
 
 
1
+ ---
2
+ title: Multimodal RAG Video Chat
3
+ emoji: 🎬
4
+ colorFrom: blue
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ ---
12
 
13
+ # 🎬 Multimodal RAG Video Chat
 
14
 
15
+ An interactive application that allows you to chat with YouTube videos using advanced multimodal retrieval-augmented generation (RAG).
 
 
 
 
16
 
17
+ ## Features
 
 
 
 
18
 
19
+ - **Video Processing**: Automatically downloads and processes YouTube videos
20
+ - **Multimodal Embeddings**: Uses BridgeTower for joint text-image understanding
21
+ - **Vector Storage**: Stores embeddings in LanceDB for efficient retrieval
22
+ - **Visual Language Model**: Powered by Pixtral for intelligent responses
23
+ - **Interactive Interface**: Chat interface with retrieved video frames display
24
 
25
+ ## How to Use
 
26
 
27
+ 1. **Load Video**: Paste a YouTube URL and click "Process Video"
28
+ 2. **Chat**: Ask questions about the video content
29
+ 3. **View Results**: See relevant video frames alongside AI responses
30
 
31
+ ## Technology Stack
 
32
 
33
+ - **Frontend**: Gradio
34
+ - **Embeddings**: BridgeTower (multimodal)
35
+ - **Vector DB**: LanceDB
36
+ - **LLM**: Pixtral-12B (Mistral AI)
37
+ - **Video Processing**: OpenCV, pytube
38
 
39
+ ## Setup
 
40
 
41
+ You'll need a Mistral AI API key to use this application. Add it as a secret named `MISTRAL_API_KEY`.
 
 
 
42
 
43
+ ## Architecture
 
 
 
 
 
44
 
45
+ The system follows a RAG (Retrieval-Augmented Generation) approach:
46
+ 1. Videos are processed into frames and transcripts
47
+ 2. Multimodal embeddings are created and stored
48
+ 3. User queries retrieve relevant video segments
49
+ 4. Visual language model generates contextual responses
50
+
51
+ Check out the [GitHub repository](https://github.com/daddyofadoggy/multimodal-rag) for more details.