| # GitHub Data Collection with Just + GitHub CLI | |
| This document describes the GitHub CLI-based data collection recipes for analyzing pull request comments, reviews, and collaboration patterns in the KGraph-MCP project. | |
| ## Prerequisites | |
| - [GitHub CLI](https://cli.github.com/) installed and authenticated | |
| - `jq` command-line JSON processor | |
| - Access to the target repository | |
| ## Available Recipes | |
| ### 1. Download All PR Comments | |
| ```bash | |
| just gh-download-pr-comments | |
| ``` | |
| **Purpose**: Downloads all pull request comments and review comments from the repository. | |
| **Output**: | |
| - JSON file in `data/github/pr_comments_YYYYMMDD_HHMMSS.json` | |
| - Includes general comments, review comments, and PR metadata | |
| **Features**: | |
| - Progress tracking for large repositories | |
| - Handles pagination automatically | |
| - Combines PR metadata with comments | |
| - Summary statistics at completion | |
| ### 2. Download Single PR Comments | |
| ```bash | |
| just gh-download-pr-comments-single 123 | |
| ``` | |
| **Purpose**: Downloads all comments for a specific PR number. | |
| **Output**: | |
| - JSON file in `data/github/pr_123_comments.json` | |
| - Includes general comments, review comments, reviews, and full PR details | |
| **Use Cases**: | |
| - Analyzing specific high-activity PRs | |
| - Debugging collaboration issues | |
| - Creating detailed PR reports | |
| ### 3. Download Comments Since Date | |
| ```bash | |
| just gh-download-pr-comments-since "2024-01-01" | |
| ``` | |
| **Purpose**: Downloads comments from PRs updated since a specific date. | |
| **Date Formats Supported**: | |
| - `YYYY-MM-DD` (e.g., "2024-01-01") | |
| - `YYYY-MM-DDTHH:MM:SS` (e.g., "2024-01-01T12:00:00") | |
| **Use Cases**: | |
| - Incremental data collection | |
| - Analyzing recent activity | |
| - Sprint-specific analysis | |
| ### 4. Export Comments to CSV | |
| ```bash | |
| just gh-export-pr-comments-csv | |
| ``` | |
| **Purpose**: Converts the most recent JSON download to CSV format. | |
| **CSV Columns**: | |
| - PR_Number | |
| - PR_Title | |
| - PR_State | |
| - Comment_Type (general/review) | |
| - Comment_ID | |
| - Author | |
| - Created_At | |
| - Updated_At | |
| - Body | |
| **Use Cases**: | |
| - Excel/Google Sheets analysis | |
| - Data visualization tools | |
| - Statistical analysis | |
| ### 5. Search PR Comments | |
| ```bash | |
| just gh-search-pr-comments "bug fix" | |
| ``` | |
| **Purpose**: Search through downloaded comments for specific text. | |
| **Features**: | |
| - Case-insensitive search | |
| - Searches both general and review comments | |
| - Shows context (PR number, title, author, date) | |
| - Regex patterns supported | |
| **Use Cases**: | |
| - Finding discussions about specific topics | |
| - Tracking decision-making processes | |
| - Code review pattern analysis | |
| ### 6. Show PR Statistics | |
| ```bash | |
| just gh-show-pr-stats | |
| ``` | |
| **Purpose**: Generate comprehensive statistics from downloaded data. | |
| **Statistics Included**: | |
| - Total PRs, comments, and review comments | |
| - Top commenters and reviewers | |
| - PRs with most activity | |
| - Activity trends by month | |
| - Participation metrics | |
| ## Data Structure | |
| ### JSON Output Format | |
| ```json | |
| [ | |
| { | |
| "number": 123, | |
| "title": "Add new feature", | |
| "state": "merged", | |
| "createdAt": "2024-01-01T10:00:00Z", | |
| "closedAt": "2024-01-02T15:30:00Z", | |
| "author": { | |
| "login": "developer1" | |
| }, | |
| "comments": [ | |
| { | |
| "id": 1234567, | |
| "author": { | |
| "login": "reviewer1" | |
| }, | |
| "body": "This looks great!", | |
| "createdAt": "2024-01-01T11:00:00Z", | |
| "updatedAt": "2024-01-01T11:00:00Z" | |
| } | |
| ], | |
| "review_comments": [ | |
| { | |
| "id": 2345678, | |
| "user": { | |
| "login": "reviewer2" | |
| }, | |
| "body": "Consider using const instead of let here.", | |
| "created_at": "2024-01-01T12:00:00Z", | |
| "updated_at": "2024-01-01T12:00:00Z", | |
| "path": "src/component.js", | |
| "line": 42 | |
| } | |
| ] | |
| } | |
| ] | |
| ``` | |
| ## Analysis Examples | |
| ### Finding Code Review Patterns | |
| ```bash | |
| # Download recent data | |
| just gh-download-pr-comments-since "2024-01-01" | |
| # Search for security-related discussions | |
| just gh-search-pr-comments "security" | |
| # Export to CSV for analysis | |
| just gh-export-pr-comments-csv | |
| # Generate statistics | |
| just gh-show-pr-stats | |
| ``` | |
| ### Team Collaboration Analysis | |
| ```bash | |
| # Download all historical data | |
| just gh-download-pr-comments | |
| # View top contributors | |
| just gh-show-pr-stats | |
| # Search for mentoring patterns | |
| just gh-search-pr-comments "looks good to me" | |
| just gh-search-pr-comments "consider" | |
| just gh-search-pr-comments "suggestion" | |
| ``` | |
| ### Quality Assurance Tracking | |
| ```bash | |
| # Search for quality-related discussions | |
| just gh-search-pr-comments "test" | |
| just gh-search-pr-comments "bug" | |
| just gh-search-pr-comments "performance" | |
| # Find PRs with extensive review activity | |
| just gh-show-pr-stats | grep "PRs with Most Comments" | |
| ``` | |
| ## Automation and CI/CD | |
| ### Scheduled Data Collection | |
| Create a cron job or GitHub Action to collect data regularly: | |
| ```yaml | |
| # .github/workflows/data-collection.yml | |
| name: Weekly Data Collection | |
| on: | |
| schedule: | |
| - cron: '0 2 * * 1' # Every Monday at 2 AM | |
| jobs: | |
| collect-data: | |
| runs-on: ubuntu-latest | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup GitHub CLI | |
| run: | | |
| gh auth login --with-token <<< "${{ secrets.GITHUB_TOKEN }}" | |
| - name: Collect PR Comments | |
| run: | | |
| just gh-download-pr-comments-since "$(date -d '7 days ago' +%Y-%m-%d)" | |
| - name: Archive Data | |
| uses: actions/upload-artifact@v4 | |
| with: | |
| name: pr-comments-data | |
| path: data/github/ | |
| ``` | |
| ### Integration with Analytics Tools | |
| The collected data can be integrated with various analytics tools: | |
| 1. **Jupyter Notebooks**: Load JSON data for Python analysis | |
| 2. **R/RStudio**: Import CSV for statistical analysis | |
| 3. **Tableau/Power BI**: Connect to CSV for visualization | |
| 4. **Google Sheets**: Import CSV for collaborative analysis | |
| ## Privacy and Security | |
| ### Data Sensitivity | |
| - Comment data may contain sensitive information | |
| - Review comments might include security discussions | |
| - Author information is public but should be handled responsibly | |
| ### Recommendations | |
| 1. **Access Control**: Limit access to downloaded data files | |
| 2. **Data Retention**: Implement retention policies for old data | |
| 3. **Anonymization**: Consider anonymizing data for certain analyses | |
| 4. **Compliance**: Ensure compliance with organizational data policies | |
| ## Troubleshooting | |
| ### Common Issues | |
| 1. **Rate Limiting**: GitHub API has rate limits | |
| - Solution: Add delays between requests if needed | |
| - Check rate limit status: `gh api rate_limit` | |
| 2. **Large Repositories**: Many PRs may cause long download times | |
| - Solution: Use date-based filtering | |
| - Consider downloading in chunks | |
| 3. **Authentication Issues**: GitHub CLI not authenticated | |
| - Solution: `gh auth login` | |
| - Verify access: `gh auth status` | |
| 4. **Missing Dependencies**: `jq` not installed | |
| - Ubuntu/Debian: `sudo apt install jq` | |
| - macOS: `brew install jq` | |
| ### Performance Optimization | |
| - Use date filtering for incremental updates | |
| - Download single PRs for focused analysis | |
| - Export to CSV for better tool compatibility | |
| - Consider compressed storage for large datasets | |
| ## File Management | |
| ### Directory Structure | |
| ``` | |
| data/ | |
| βββ github/ | |
| βββ pr_comments_20240315_143022.json # Full download | |
| βββ pr_comments_20240315_143022.csv # CSV export | |
| βββ pr_123_comments.json # Single PR | |
| βββ pr_comments_since_20240301_*.json # Date-filtered | |
| ``` | |
| ### Cleanup | |
| ```bash | |
| # Remove old data files (older than 30 days) | |
| find data/github -name "*.json" -mtime +30 -delete | |
| # Keep only the 5 most recent files | |
| ls -t data/github/pr_comments_*.json | tail -n +6 | xargs rm -f | |
| ``` | |
| ## Integration with Development Workflow | |
| These recipes can enhance your development workflow by: | |
| 1. **Code Review Analysis**: Understanding review patterns and quality | |
| 2. **Team Performance**: Measuring collaboration and engagement | |
| 3. **Knowledge Management**: Tracking decisions and discussions | |
| 4. **Process Improvement**: Identifying bottlenecks and inefficiencies | |
| 5. **Onboarding**: Helping new team members understand project history | |
| The GitHub data collection system provides powerful insights into your development process and team collaboration patterns. |