VisionPro / README.md
subhash4face's picture
Added social media post
4a0fae8 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: AI Assistant For Visually Impaired
emoji: πŸ’¬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
  - inference-api
license: mit
short_description: AI-Assistant-for-Visually-Impaired
tags:
  - mcp-in-action-track-consumer
  - building-mcp-track-consumer

Accessibility Voice Agent β€” MCP Tools

Track: mcp-in-action-track-consumer

Team: Team

Author: @subhash4face β€” Subhash Mankunnu

Author: Athira AR

Model Context Protocol (MCP) + Gradio 6 + HF Inference + ElevenLabs

A fully accessible, voice-driven AI assistant demonstrating how MCP tools can enable speech-to-text, image understanding, and text-to-speech workflows for low-vision and visually impaired users.

This project showcases a real-world use case of MCP tools working together inside an agent-style UI.


πŸ”„ Workflow Diagram β€” MCP Tools

Workflow Diagram

πŸš€ Demo Video

πŸ‘‰ https://youtu.be/af4Y89g2HPE

πŸš€ Social Media Post - LinkedIn

πŸ‘‰ https://www.linkedin.com/posts/subhashmankunnu_hugginface-share-7400924735989010432-a9sH?utm_source=share&utm_medium=member_desktop&rcm=ACoAAASVxnsB9ojyfy-Kef3IWvBPf4c3pUSOaWw


🌟 Key Features

πŸ”Š Text-to-Speech (TTS) via ElevenLabs

MCP Tool: speak_text

  • Converts any assistant message to natural speech
  • Returns base64 audio + WAV playback
  • Helps low-vision users receive spoken responses

🎀 Speech-to-Text (STT) via Whisper / Local fallback

MCP Tool: transcribe_audio

  • OpenAI Whisper STT or local fallback
  • Great for hands-free usage
  • Tool-call log shows backend + duration

πŸ–Ό Image Description via OpenAI / Gemini / HF Inference

MCP Tool: describe_image

  • Multimodal accessibility
  • Describes any uploaded image in plain language
  • Hugging Face Inference API used instead of local BLIP

🧩 Fully MCP-powered

Every capability is wrapped as an MCP tool, making this app a template for:

  • Agents
  • Assistive technologies
  • Multimodal accessibility apps
  • Voice-driven workflows
  • Cross-backend tool orchestration

πŸ’‘ Real Use Case: Accessibility

Designed for:

  • Low-vision users
  • Voice-interface users
  • Anyone needing automated image descriptions
  • Hands-free workflows
  • Assistive technology research

πŸ›  Tech Stack

  • MCP Server (Python)
  • Gradio 6
  • OpenAI Whisper (STT)
  • ElevenLabs (TTS)
  • Gemini Vision (optional)
  • Hugging Face Inference API (image captioning)
  • Python

🏁 How to Run Locally

pip install -r requirements.txt
python app.py