Spaces:
Build error
Build error
| gradio>=4.0.0 | |
| requests>=2.31.0 | |
| python-multipart>=0.0.6 | |
| pathlib>=1.0.1 | |
| re>=2.2.1 | |
| hashlib>=20081119 | |
| zipfile>=0.5 | |
| io>=0.1 | |
| datetime>=4.3 | |
| mimetypes>=0.1 | |
| fnmatch>=2.4.3 | |
| base64>=0.1 | |
| json>=2.0.9 | |
| This Gradio application provides a comprehensive solution for converting GitHub or Hugging Face repositories into text files suitable for LLM training. Here are the key features: | |
| ## π Main Features: | |
| 1. **Multi-Platform Support**: Works with both GitHub and Hugging Face repositories | |
| 2. **Smart File Filtering**: Include/exclude patterns to process only relevant files | |
| 3. **Token Estimation**: Provides rough token counts for training planning | |
| 4. **Chunked Output**: Splits large repositories into manageable chunks | |
| 5. **Comment Removal**: Optional comment stripping for cleaner training data | |
| 6. **Binary File Detection**: Automatically skips binary files | |
| 7. **Language Detection**: Identifies programming languages for better organization | |
| 8. **Progress Tracking**: Real-time progress updates during processing | |
| ## π οΈ Advanced Options: | |
| - File size limits to prevent processing huge files | |
| - Pattern-based filtering (glob patterns supported) | |
| - Chunk size customization | |
| - Metadata inclusion | |
| - Private repository support with tokens | |
| - ZIP download option | |
| ## π Output Features: | |
| - Repository metadata and statistics | |
| - File headers with path, size, and language info | |
| - Token and character counts | |
| - Formatted, readable output structure | |
| - Error handling and status messages | |
| The application is designed to handle repositories of various sizes while providing useful feedback and statistics about the processed content. It's perfect for preparing code repositories for LLM fine-tuning or analysis. |