multimodal-rag / README.md
doggdad's picture
Update README.md
b6f1ae6 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Multimodal RAG Video Chat
emoji: 🎬
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 5.44.0
app_file: src/app.py
pinned: false
license: apache-2.0

🎬 Multimodal RAG Video Chat

An interactive application that allows you to chat with YouTube videos using advanced multimodal retrieval-augmented generation (RAG).

Features

  • Video Processing: Automatically downloads and processes YouTube videos
  • Multimodal Embeddings: Uses BridgeTower for joint text-image understanding
  • Vector Storage: Stores embeddings in LanceDB for efficient retrieval
  • Visual Language Model: Powered by Pixtral for intelligent responses
  • Interactive Interface: Chat interface with retrieved video frames display

How to Use

  1. Load Video: Paste a YouTube URL and click "Process Video"
  2. Chat: Ask questions about the video content
  3. View Results: See relevant video frames alongside AI responses

Technology Stack

  • Frontend: Gradio
  • Embeddings: BridgeTower (multimodal)
  • Vector DB: LanceDB
  • LLM: Pixtral-12B (Mistral AI)
  • Video Processing: OpenCV, pytube

Setup

You'll need a Mistral AI API key to use this application. Add it as a secret named MISTRAL_API_KEY.

Architecture

The system follows a RAG (Retrieval-Augmented Generation) approach:

  1. Videos are processed into frames and transcripts
  2. Multimodal embeddings are created and stored
  3. User queries retrieve relevant video segments
  4. Visual language model generates contextual responses

Check out the GitHub repository for more details.