Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
metadata
title: Multimodal RAG Video Chat
emoji: π¬
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 5.44.0
app_file: src/app.py
pinned: false
license: apache-2.0
π¬ Multimodal RAG Video Chat
An interactive application that allows you to chat with YouTube videos using advanced multimodal retrieval-augmented generation (RAG).
Features
- Video Processing: Automatically downloads and processes YouTube videos
- Multimodal Embeddings: Uses BridgeTower for joint text-image understanding
- Vector Storage: Stores embeddings in LanceDB for efficient retrieval
- Visual Language Model: Powered by Pixtral for intelligent responses
- Interactive Interface: Chat interface with retrieved video frames display
How to Use
- Load Video: Paste a YouTube URL and click "Process Video"
- Chat: Ask questions about the video content
- View Results: See relevant video frames alongside AI responses
Technology Stack
- Frontend: Gradio
- Embeddings: BridgeTower (multimodal)
- Vector DB: LanceDB
- LLM: Pixtral-12B (Mistral AI)
- Video Processing: OpenCV, pytube
Setup
You'll need a Mistral AI API key to use this application. Add it as a secret named MISTRAL_API_KEY.
Architecture
The system follows a RAG (Retrieval-Augmented Generation) approach:
- Videos are processed into frames and transcripts
- Multimodal embeddings are created and stored
- User queries retrieve relevant video segments
- Visual language model generates contextual responses
Check out the GitHub repository for more details.