Spaces:

doggdad
/

multimodal-rag

Sleeping

App Files Files Community

multimodal-rag / README.md

doggdad

Update README.md

b6f1ae6 verified 4 months ago

preview code

raw

history blame contribute delete

1.64 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: Multimodal RAG Video Chat
emoji: 🎬
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 5.44.0
app_file: src/app.py
pinned: false
license: apache-2.0

🎬 Multimodal RAG Video Chat

An interactive application that allows you to chat with YouTube videos using advanced multimodal retrieval-augmented generation (RAG).

Features

Video Processing: Automatically downloads and processes YouTube videos
Multimodal Embeddings: Uses BridgeTower for joint text-image understanding
Vector Storage: Stores embeddings in LanceDB for efficient retrieval
Visual Language Model: Powered by Pixtral for intelligent responses
Interactive Interface: Chat interface with retrieved video frames display

How to Use

Load Video: Paste a YouTube URL and click "Process Video"
Chat: Ask questions about the video content
View Results: See relevant video frames alongside AI responses

Technology Stack

Frontend: Gradio
Embeddings: BridgeTower (multimodal)
Vector DB: LanceDB
LLM: Pixtral-12B (Mistral AI)
Video Processing: OpenCV, pytube

Setup

You'll need a Mistral AI API key to use this application. Add it as a secret named MISTRAL_API_KEY.

Architecture

The system follows a RAG (Retrieval-Augmented Generation) approach:

Videos are processed into frames and transcripts
Multimodal embeddings are created and stored
User queries retrieve relevant video segments
Visual language model generates contextual responses

Check out the GitHub repository for more details.