kgraph-mcp-agent-platform / docs /developer-guide /github-data-collection.md
BasalGanglia's picture
πŸ† Multi-Track Hackathon Submission
1f2d50a verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

GitHub Data Collection with Just + GitHub CLI

This document describes the GitHub CLI-based data collection recipes for analyzing pull request comments, reviews, and collaboration patterns in the KGraph-MCP project.

Prerequisites

  • GitHub CLI installed and authenticated
  • jq command-line JSON processor
  • Access to the target repository

Available Recipes

1. Download All PR Comments

just gh-download-pr-comments

Purpose: Downloads all pull request comments and review comments from the repository.

Output:

  • JSON file in data/github/pr_comments_YYYYMMDD_HHMMSS.json
  • Includes general comments, review comments, and PR metadata

Features:

  • Progress tracking for large repositories
  • Handles pagination automatically
  • Combines PR metadata with comments
  • Summary statistics at completion

2. Download Single PR Comments

just gh-download-pr-comments-single 123

Purpose: Downloads all comments for a specific PR number.

Output:

  • JSON file in data/github/pr_123_comments.json
  • Includes general comments, review comments, reviews, and full PR details

Use Cases:

  • Analyzing specific high-activity PRs
  • Debugging collaboration issues
  • Creating detailed PR reports

3. Download Comments Since Date

just gh-download-pr-comments-since "2024-01-01"

Purpose: Downloads comments from PRs updated since a specific date.

Date Formats Supported:

  • YYYY-MM-DD (e.g., "2024-01-01")
  • YYYY-MM-DDTHH:MM:SS (e.g., "2024-01-01T12:00:00")

Use Cases:

  • Incremental data collection
  • Analyzing recent activity
  • Sprint-specific analysis

4. Export Comments to CSV

just gh-export-pr-comments-csv

Purpose: Converts the most recent JSON download to CSV format.

CSV Columns:

  • PR_Number
  • PR_Title
  • PR_State
  • Comment_Type (general/review)
  • Comment_ID
  • Author
  • Created_At
  • Updated_At
  • Body

Use Cases:

  • Excel/Google Sheets analysis
  • Data visualization tools
  • Statistical analysis

5. Search PR Comments

just gh-search-pr-comments "bug fix"

Purpose: Search through downloaded comments for specific text.

Features:

  • Case-insensitive search
  • Searches both general and review comments
  • Shows context (PR number, title, author, date)
  • Regex patterns supported

Use Cases:

  • Finding discussions about specific topics
  • Tracking decision-making processes
  • Code review pattern analysis

6. Show PR Statistics

just gh-show-pr-stats

Purpose: Generate comprehensive statistics from downloaded data.

Statistics Included:

  • Total PRs, comments, and review comments
  • Top commenters and reviewers
  • PRs with most activity
  • Activity trends by month
  • Participation metrics

Data Structure

JSON Output Format

[
  {
    "number": 123,
    "title": "Add new feature",
    "state": "merged",
    "createdAt": "2024-01-01T10:00:00Z",
    "closedAt": "2024-01-02T15:30:00Z",
    "author": {
      "login": "developer1"
    },
    "comments": [
      {
        "id": 1234567,
        "author": {
          "login": "reviewer1"
        },
        "body": "This looks great!",
        "createdAt": "2024-01-01T11:00:00Z",
        "updatedAt": "2024-01-01T11:00:00Z"
      }
    ],
    "review_comments": [
      {
        "id": 2345678,
        "user": {
          "login": "reviewer2"
        },
        "body": "Consider using const instead of let here.",
        "created_at": "2024-01-01T12:00:00Z",
        "updated_at": "2024-01-01T12:00:00Z",
        "path": "src/component.js",
        "line": 42
      }
    ]
  }
]

Analysis Examples

Finding Code Review Patterns

# Download recent data
just gh-download-pr-comments-since "2024-01-01"

# Search for security-related discussions
just gh-search-pr-comments "security"

# Export to CSV for analysis
just gh-export-pr-comments-csv

# Generate statistics
just gh-show-pr-stats

Team Collaboration Analysis

# Download all historical data
just gh-download-pr-comments

# View top contributors
just gh-show-pr-stats

# Search for mentoring patterns
just gh-search-pr-comments "looks good to me"
just gh-search-pr-comments "consider"
just gh-search-pr-comments "suggestion"

Quality Assurance Tracking

# Search for quality-related discussions
just gh-search-pr-comments "test"
just gh-search-pr-comments "bug"
just gh-search-pr-comments "performance"

# Find PRs with extensive review activity
just gh-show-pr-stats | grep "PRs with Most Comments"

Automation and CI/CD

Scheduled Data Collection

Create a cron job or GitHub Action to collect data regularly:

# .github/workflows/data-collection.yml
name: Weekly Data Collection

on:
  schedule:
    - cron: '0 2 * * 1'  # Every Monday at 2 AM

jobs:
  collect-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup GitHub CLI
        run: |
          gh auth login --with-token <<< "${{ secrets.GITHUB_TOKEN }}"
      - name: Collect PR Comments
        run: |
          just gh-download-pr-comments-since "$(date -d '7 days ago' +%Y-%m-%d)"
      - name: Archive Data
        uses: actions/upload-artifact@v4
        with:
          name: pr-comments-data
          path: data/github/

Integration with Analytics Tools

The collected data can be integrated with various analytics tools:

  1. Jupyter Notebooks: Load JSON data for Python analysis
  2. R/RStudio: Import CSV for statistical analysis
  3. Tableau/Power BI: Connect to CSV for visualization
  4. Google Sheets: Import CSV for collaborative analysis

Privacy and Security

Data Sensitivity

  • Comment data may contain sensitive information
  • Review comments might include security discussions
  • Author information is public but should be handled responsibly

Recommendations

  1. Access Control: Limit access to downloaded data files
  2. Data Retention: Implement retention policies for old data
  3. Anonymization: Consider anonymizing data for certain analyses
  4. Compliance: Ensure compliance with organizational data policies

Troubleshooting

Common Issues

  1. Rate Limiting: GitHub API has rate limits

    • Solution: Add delays between requests if needed
    • Check rate limit status: gh api rate_limit
  2. Large Repositories: Many PRs may cause long download times

    • Solution: Use date-based filtering
    • Consider downloading in chunks
  3. Authentication Issues: GitHub CLI not authenticated

    • Solution: gh auth login
    • Verify access: gh auth status
  4. Missing Dependencies: jq not installed

    • Ubuntu/Debian: sudo apt install jq
    • macOS: brew install jq

Performance Optimization

  • Use date filtering for incremental updates
  • Download single PRs for focused analysis
  • Export to CSV for better tool compatibility
  • Consider compressed storage for large datasets

File Management

Directory Structure

data/
└── github/
    β”œβ”€β”€ pr_comments_20240315_143022.json    # Full download
    β”œβ”€β”€ pr_comments_20240315_143022.csv     # CSV export
    β”œβ”€β”€ pr_123_comments.json                # Single PR
    └── pr_comments_since_20240301_*.json   # Date-filtered

Cleanup

# Remove old data files (older than 30 days)
find data/github -name "*.json" -mtime +30 -delete

# Keep only the 5 most recent files
ls -t data/github/pr_comments_*.json | tail -n +6 | xargs rm -f

Integration with Development Workflow

These recipes can enhance your development workflow by:

  1. Code Review Analysis: Understanding review patterns and quality
  2. Team Performance: Measuring collaboration and engagement
  3. Knowledge Management: Tracking decisions and discussions
  4. Process Improvement: Identifying bottlenecks and inefficiencies
  5. Onboarding: Helping new team members understand project history

The GitHub data collection system provides powerful insights into your development process and team collaboration patterns.