A newer version of the Gradio SDK is available:
6.2.0
GitHub Data Collection with Just + GitHub CLI
This document describes the GitHub CLI-based data collection recipes for analyzing pull request comments, reviews, and collaboration patterns in the KGraph-MCP project.
Prerequisites
- GitHub CLI installed and authenticated
jqcommand-line JSON processor- Access to the target repository
Available Recipes
1. Download All PR Comments
just gh-download-pr-comments
Purpose: Downloads all pull request comments and review comments from the repository.
Output:
- JSON file in
data/github/pr_comments_YYYYMMDD_HHMMSS.json - Includes general comments, review comments, and PR metadata
Features:
- Progress tracking for large repositories
- Handles pagination automatically
- Combines PR metadata with comments
- Summary statistics at completion
2. Download Single PR Comments
just gh-download-pr-comments-single 123
Purpose: Downloads all comments for a specific PR number.
Output:
- JSON file in
data/github/pr_123_comments.json - Includes general comments, review comments, reviews, and full PR details
Use Cases:
- Analyzing specific high-activity PRs
- Debugging collaboration issues
- Creating detailed PR reports
3. Download Comments Since Date
just gh-download-pr-comments-since "2024-01-01"
Purpose: Downloads comments from PRs updated since a specific date.
Date Formats Supported:
YYYY-MM-DD(e.g., "2024-01-01")YYYY-MM-DDTHH:MM:SS(e.g., "2024-01-01T12:00:00")
Use Cases:
- Incremental data collection
- Analyzing recent activity
- Sprint-specific analysis
4. Export Comments to CSV
just gh-export-pr-comments-csv
Purpose: Converts the most recent JSON download to CSV format.
CSV Columns:
- PR_Number
- PR_Title
- PR_State
- Comment_Type (general/review)
- Comment_ID
- Author
- Created_At
- Updated_At
- Body
Use Cases:
- Excel/Google Sheets analysis
- Data visualization tools
- Statistical analysis
5. Search PR Comments
just gh-search-pr-comments "bug fix"
Purpose: Search through downloaded comments for specific text.
Features:
- Case-insensitive search
- Searches both general and review comments
- Shows context (PR number, title, author, date)
- Regex patterns supported
Use Cases:
- Finding discussions about specific topics
- Tracking decision-making processes
- Code review pattern analysis
6. Show PR Statistics
just gh-show-pr-stats
Purpose: Generate comprehensive statistics from downloaded data.
Statistics Included:
- Total PRs, comments, and review comments
- Top commenters and reviewers
- PRs with most activity
- Activity trends by month
- Participation metrics
Data Structure
JSON Output Format
[
{
"number": 123,
"title": "Add new feature",
"state": "merged",
"createdAt": "2024-01-01T10:00:00Z",
"closedAt": "2024-01-02T15:30:00Z",
"author": {
"login": "developer1"
},
"comments": [
{
"id": 1234567,
"author": {
"login": "reviewer1"
},
"body": "This looks great!",
"createdAt": "2024-01-01T11:00:00Z",
"updatedAt": "2024-01-01T11:00:00Z"
}
],
"review_comments": [
{
"id": 2345678,
"user": {
"login": "reviewer2"
},
"body": "Consider using const instead of let here.",
"created_at": "2024-01-01T12:00:00Z",
"updated_at": "2024-01-01T12:00:00Z",
"path": "src/component.js",
"line": 42
}
]
}
]
Analysis Examples
Finding Code Review Patterns
# Download recent data
just gh-download-pr-comments-since "2024-01-01"
# Search for security-related discussions
just gh-search-pr-comments "security"
# Export to CSV for analysis
just gh-export-pr-comments-csv
# Generate statistics
just gh-show-pr-stats
Team Collaboration Analysis
# Download all historical data
just gh-download-pr-comments
# View top contributors
just gh-show-pr-stats
# Search for mentoring patterns
just gh-search-pr-comments "looks good to me"
just gh-search-pr-comments "consider"
just gh-search-pr-comments "suggestion"
Quality Assurance Tracking
# Search for quality-related discussions
just gh-search-pr-comments "test"
just gh-search-pr-comments "bug"
just gh-search-pr-comments "performance"
# Find PRs with extensive review activity
just gh-show-pr-stats | grep "PRs with Most Comments"
Automation and CI/CD
Scheduled Data Collection
Create a cron job or GitHub Action to collect data regularly:
# .github/workflows/data-collection.yml
name: Weekly Data Collection
on:
schedule:
- cron: '0 2 * * 1' # Every Monday at 2 AM
jobs:
collect-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup GitHub CLI
run: |
gh auth login --with-token <<< "${{ secrets.GITHUB_TOKEN }}"
- name: Collect PR Comments
run: |
just gh-download-pr-comments-since "$(date -d '7 days ago' +%Y-%m-%d)"
- name: Archive Data
uses: actions/upload-artifact@v4
with:
name: pr-comments-data
path: data/github/
Integration with Analytics Tools
The collected data can be integrated with various analytics tools:
- Jupyter Notebooks: Load JSON data for Python analysis
- R/RStudio: Import CSV for statistical analysis
- Tableau/Power BI: Connect to CSV for visualization
- Google Sheets: Import CSV for collaborative analysis
Privacy and Security
Data Sensitivity
- Comment data may contain sensitive information
- Review comments might include security discussions
- Author information is public but should be handled responsibly
Recommendations
- Access Control: Limit access to downloaded data files
- Data Retention: Implement retention policies for old data
- Anonymization: Consider anonymizing data for certain analyses
- Compliance: Ensure compliance with organizational data policies
Troubleshooting
Common Issues
Rate Limiting: GitHub API has rate limits
- Solution: Add delays between requests if needed
- Check rate limit status:
gh api rate_limit
Large Repositories: Many PRs may cause long download times
- Solution: Use date-based filtering
- Consider downloading in chunks
Authentication Issues: GitHub CLI not authenticated
- Solution:
gh auth login - Verify access:
gh auth status
- Solution:
Missing Dependencies:
jqnot installed- Ubuntu/Debian:
sudo apt install jq - macOS:
brew install jq
- Ubuntu/Debian:
Performance Optimization
- Use date filtering for incremental updates
- Download single PRs for focused analysis
- Export to CSV for better tool compatibility
- Consider compressed storage for large datasets
File Management
Directory Structure
data/
βββ github/
βββ pr_comments_20240315_143022.json # Full download
βββ pr_comments_20240315_143022.csv # CSV export
βββ pr_123_comments.json # Single PR
βββ pr_comments_since_20240301_*.json # Date-filtered
Cleanup
# Remove old data files (older than 30 days)
find data/github -name "*.json" -mtime +30 -delete
# Keep only the 5 most recent files
ls -t data/github/pr_comments_*.json | tail -n +6 | xargs rm -f
Integration with Development Workflow
These recipes can enhance your development workflow by:
- Code Review Analysis: Understanding review patterns and quality
- Team Performance: Measuring collaboration and engagement
- Knowledge Management: Tracking decisions and discussions
- Process Improvement: Identifying bottlenecks and inefficiencies
- Onboarding: Helping new team members understand project history
The GitHub data collection system provides powerful insights into your development process and team collaboration patterns.