File size: 1,690 Bytes
d79890b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
gradio>=4.0.0
requests>=2.31.0
python-multipart>=0.0.6
pathlib>=1.0.1
re>=2.2.1
hashlib>=20081119
zipfile>=0.5
io>=0.1
datetime>=4.3
mimetypes>=0.1
fnmatch>=2.4.3
base64>=0.1
json>=2.0.9

This Gradio application provides a comprehensive solution for converting GitHub or Hugging Face repositories into text files suitable for LLM training. Here are the key features:

## πŸš€ Main Features:

1. **Multi-Platform Support**: Works with both GitHub and Hugging Face repositories
2. **Smart File Filtering**: Include/exclude patterns to process only relevant files
3. **Token Estimation**: Provides rough token counts for training planning
4. **Chunked Output**: Splits large repositories into manageable chunks
5. **Comment Removal**: Optional comment stripping for cleaner training data
6. **Binary File Detection**: Automatically skips binary files
7. **Language Detection**: Identifies programming languages for better organization
8. **Progress Tracking**: Real-time progress updates during processing

## πŸ› οΈ Advanced Options:

- File size limits to prevent processing huge files
- Pattern-based filtering (glob patterns supported)
- Chunk size customization
- Metadata inclusion
- Private repository support with tokens
- ZIP download option

## πŸ“Š Output Features:

- Repository metadata and statistics
- File headers with path, size, and language info
- Token and character counts
- Formatted, readable output structure
- Error handling and status messages

The application is designed to handle repositories of various sizes while providing useful feedback and statistics about the processed content. It's perfect for preparing code repositories for LLM fine-tuning or analysis.