# GitHub Data Collection with Just + GitHub CLI This document describes the GitHub CLI-based data collection recipes for analyzing pull request comments, reviews, and collaboration patterns in the KGraph-MCP project. ## Prerequisites - [GitHub CLI](https://cli.github.com/) installed and authenticated - `jq` command-line JSON processor - Access to the target repository ## Available Recipes ### 1. Download All PR Comments ```bash just gh-download-pr-comments ``` **Purpose**: Downloads all pull request comments and review comments from the repository. **Output**: - JSON file in `data/github/pr_comments_YYYYMMDD_HHMMSS.json` - Includes general comments, review comments, and PR metadata **Features**: - Progress tracking for large repositories - Handles pagination automatically - Combines PR metadata with comments - Summary statistics at completion ### 2. Download Single PR Comments ```bash just gh-download-pr-comments-single 123 ``` **Purpose**: Downloads all comments for a specific PR number. **Output**: - JSON file in `data/github/pr_123_comments.json` - Includes general comments, review comments, reviews, and full PR details **Use Cases**: - Analyzing specific high-activity PRs - Debugging collaboration issues - Creating detailed PR reports ### 3. Download Comments Since Date ```bash just gh-download-pr-comments-since "2024-01-01" ``` **Purpose**: Downloads comments from PRs updated since a specific date. **Date Formats Supported**: - `YYYY-MM-DD` (e.g., "2024-01-01") - `YYYY-MM-DDTHH:MM:SS` (e.g., "2024-01-01T12:00:00") **Use Cases**: - Incremental data collection - Analyzing recent activity - Sprint-specific analysis ### 4. Export Comments to CSV ```bash just gh-export-pr-comments-csv ``` **Purpose**: Converts the most recent JSON download to CSV format. **CSV Columns**: - PR_Number - PR_Title - PR_State - Comment_Type (general/review) - Comment_ID - Author - Created_At - Updated_At - Body **Use Cases**: - Excel/Google Sheets analysis - Data visualization tools - Statistical analysis ### 5. Search PR Comments ```bash just gh-search-pr-comments "bug fix" ``` **Purpose**: Search through downloaded comments for specific text. **Features**: - Case-insensitive search - Searches both general and review comments - Shows context (PR number, title, author, date) - Regex patterns supported **Use Cases**: - Finding discussions about specific topics - Tracking decision-making processes - Code review pattern analysis ### 6. Show PR Statistics ```bash just gh-show-pr-stats ``` **Purpose**: Generate comprehensive statistics from downloaded data. **Statistics Included**: - Total PRs, comments, and review comments - Top commenters and reviewers - PRs with most activity - Activity trends by month - Participation metrics ## Data Structure ### JSON Output Format ```json [ { "number": 123, "title": "Add new feature", "state": "merged", "createdAt": "2024-01-01T10:00:00Z", "closedAt": "2024-01-02T15:30:00Z", "author": { "login": "developer1" }, "comments": [ { "id": 1234567, "author": { "login": "reviewer1" }, "body": "This looks great!", "createdAt": "2024-01-01T11:00:00Z", "updatedAt": "2024-01-01T11:00:00Z" } ], "review_comments": [ { "id": 2345678, "user": { "login": "reviewer2" }, "body": "Consider using const instead of let here.", "created_at": "2024-01-01T12:00:00Z", "updated_at": "2024-01-01T12:00:00Z", "path": "src/component.js", "line": 42 } ] } ] ``` ## Analysis Examples ### Finding Code Review Patterns ```bash # Download recent data just gh-download-pr-comments-since "2024-01-01" # Search for security-related discussions just gh-search-pr-comments "security" # Export to CSV for analysis just gh-export-pr-comments-csv # Generate statistics just gh-show-pr-stats ``` ### Team Collaboration Analysis ```bash # Download all historical data just gh-download-pr-comments # View top contributors just gh-show-pr-stats # Search for mentoring patterns just gh-search-pr-comments "looks good to me" just gh-search-pr-comments "consider" just gh-search-pr-comments "suggestion" ``` ### Quality Assurance Tracking ```bash # Search for quality-related discussions just gh-search-pr-comments "test" just gh-search-pr-comments "bug" just gh-search-pr-comments "performance" # Find PRs with extensive review activity just gh-show-pr-stats | grep "PRs with Most Comments" ``` ## Automation and CI/CD ### Scheduled Data Collection Create a cron job or GitHub Action to collect data regularly: ```yaml # .github/workflows/data-collection.yml name: Weekly Data Collection on: schedule: - cron: '0 2 * * 1' # Every Monday at 2 AM jobs: collect-data: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup GitHub CLI run: | gh auth login --with-token <<< "${{ secrets.GITHUB_TOKEN }}" - name: Collect PR Comments run: | just gh-download-pr-comments-since "$(date -d '7 days ago' +%Y-%m-%d)" - name: Archive Data uses: actions/upload-artifact@v4 with: name: pr-comments-data path: data/github/ ``` ### Integration with Analytics Tools The collected data can be integrated with various analytics tools: 1. **Jupyter Notebooks**: Load JSON data for Python analysis 2. **R/RStudio**: Import CSV for statistical analysis 3. **Tableau/Power BI**: Connect to CSV for visualization 4. **Google Sheets**: Import CSV for collaborative analysis ## Privacy and Security ### Data Sensitivity - Comment data may contain sensitive information - Review comments might include security discussions - Author information is public but should be handled responsibly ### Recommendations 1. **Access Control**: Limit access to downloaded data files 2. **Data Retention**: Implement retention policies for old data 3. **Anonymization**: Consider anonymizing data for certain analyses 4. **Compliance**: Ensure compliance with organizational data policies ## Troubleshooting ### Common Issues 1. **Rate Limiting**: GitHub API has rate limits - Solution: Add delays between requests if needed - Check rate limit status: `gh api rate_limit` 2. **Large Repositories**: Many PRs may cause long download times - Solution: Use date-based filtering - Consider downloading in chunks 3. **Authentication Issues**: GitHub CLI not authenticated - Solution: `gh auth login` - Verify access: `gh auth status` 4. **Missing Dependencies**: `jq` not installed - Ubuntu/Debian: `sudo apt install jq` - macOS: `brew install jq` ### Performance Optimization - Use date filtering for incremental updates - Download single PRs for focused analysis - Export to CSV for better tool compatibility - Consider compressed storage for large datasets ## File Management ### Directory Structure ``` data/ └── github/ ├── pr_comments_20240315_143022.json # Full download ├── pr_comments_20240315_143022.csv # CSV export ├── pr_123_comments.json # Single PR └── pr_comments_since_20240301_*.json # Date-filtered ``` ### Cleanup ```bash # Remove old data files (older than 30 days) find data/github -name "*.json" -mtime +30 -delete # Keep only the 5 most recent files ls -t data/github/pr_comments_*.json | tail -n +6 | xargs rm -f ``` ## Integration with Development Workflow These recipes can enhance your development workflow by: 1. **Code Review Analysis**: Understanding review patterns and quality 2. **Team Performance**: Measuring collaboration and engagement 3. **Knowledge Management**: Tracking decisions and discussions 4. **Process Improvement**: Identifying bottlenecks and inefficiencies 5. **Onboarding**: Helping new team members understand project history The GitHub data collection system provides powerful insights into your development process and team collaboration patterns.