Spaces:

doggdad
/

multimodal-rag

Sleeping

File size: 1,637 Bytes

---
title: Multimodal RAG Video Chat
emoji: 🎬
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 5.44.0
app_file: src/app.py
pinned: false
license: apache-2.0
---

# 🎬 Multimodal RAG Video Chat

An interactive application that allows you to chat with YouTube videos using advanced multimodal retrieval-augmented generation (RAG).

## Features

- **Video Processing**: Automatically downloads and processes YouTube videos
- **Multimodal Embeddings**: Uses BridgeTower for joint text-image understanding  
- **Vector Storage**: Stores embeddings in LanceDB for efficient retrieval
- **Visual Language Model**: Powered by Pixtral for intelligent responses
- **Interactive Interface**: Chat interface with retrieved video frames display

## How to Use

1. **Load Video**: Paste a YouTube URL and click "Process Video"
2. **Chat**: Ask questions about the video content
3. **View Results**: See relevant video frames alongside AI responses

## Technology Stack

- **Frontend**: Gradio
- **Embeddings**: BridgeTower (multimodal)
- **Vector DB**: LanceDB
- **LLM**: Pixtral-12B (Mistral AI)
- **Video Processing**: OpenCV, pytube

## Setup

You'll need a Mistral AI API key to use this application. Add it as a secret named `MISTRAL_API_KEY`.

## Architecture

The system follows a RAG (Retrieval-Augmented Generation) approach:
1. Videos are processed into frames and transcripts
2. Multimodal embeddings are created and stored
3. User queries retrieve relevant video segments
4. Visual language model generates contextual responses

Check out the [GitHub repository](https://github.com/daddyofadoggy/multimodal-rag) for more details.