kgraph-mcp-agent-platform / docs /developer-guide /github-data-collection.md
BasalGanglia's picture
πŸ† Multi-Track Hackathon Submission
1f2d50a verified
# GitHub Data Collection with Just + GitHub CLI
This document describes the GitHub CLI-based data collection recipes for analyzing pull request comments, reviews, and collaboration patterns in the KGraph-MCP project.
## Prerequisites
- [GitHub CLI](https://cli.github.com/) installed and authenticated
- `jq` command-line JSON processor
- Access to the target repository
## Available Recipes
### 1. Download All PR Comments
```bash
just gh-download-pr-comments
```
**Purpose**: Downloads all pull request comments and review comments from the repository.
**Output**:
- JSON file in `data/github/pr_comments_YYYYMMDD_HHMMSS.json`
- Includes general comments, review comments, and PR metadata
**Features**:
- Progress tracking for large repositories
- Handles pagination automatically
- Combines PR metadata with comments
- Summary statistics at completion
### 2. Download Single PR Comments
```bash
just gh-download-pr-comments-single 123
```
**Purpose**: Downloads all comments for a specific PR number.
**Output**:
- JSON file in `data/github/pr_123_comments.json`
- Includes general comments, review comments, reviews, and full PR details
**Use Cases**:
- Analyzing specific high-activity PRs
- Debugging collaboration issues
- Creating detailed PR reports
### 3. Download Comments Since Date
```bash
just gh-download-pr-comments-since "2024-01-01"
```
**Purpose**: Downloads comments from PRs updated since a specific date.
**Date Formats Supported**:
- `YYYY-MM-DD` (e.g., "2024-01-01")
- `YYYY-MM-DDTHH:MM:SS` (e.g., "2024-01-01T12:00:00")
**Use Cases**:
- Incremental data collection
- Analyzing recent activity
- Sprint-specific analysis
### 4. Export Comments to CSV
```bash
just gh-export-pr-comments-csv
```
**Purpose**: Converts the most recent JSON download to CSV format.
**CSV Columns**:
- PR_Number
- PR_Title
- PR_State
- Comment_Type (general/review)
- Comment_ID
- Author
- Created_At
- Updated_At
- Body
**Use Cases**:
- Excel/Google Sheets analysis
- Data visualization tools
- Statistical analysis
### 5. Search PR Comments
```bash
just gh-search-pr-comments "bug fix"
```
**Purpose**: Search through downloaded comments for specific text.
**Features**:
- Case-insensitive search
- Searches both general and review comments
- Shows context (PR number, title, author, date)
- Regex patterns supported
**Use Cases**:
- Finding discussions about specific topics
- Tracking decision-making processes
- Code review pattern analysis
### 6. Show PR Statistics
```bash
just gh-show-pr-stats
```
**Purpose**: Generate comprehensive statistics from downloaded data.
**Statistics Included**:
- Total PRs, comments, and review comments
- Top commenters and reviewers
- PRs with most activity
- Activity trends by month
- Participation metrics
## Data Structure
### JSON Output Format
```json
[
{
"number": 123,
"title": "Add new feature",
"state": "merged",
"createdAt": "2024-01-01T10:00:00Z",
"closedAt": "2024-01-02T15:30:00Z",
"author": {
"login": "developer1"
},
"comments": [
{
"id": 1234567,
"author": {
"login": "reviewer1"
},
"body": "This looks great!",
"createdAt": "2024-01-01T11:00:00Z",
"updatedAt": "2024-01-01T11:00:00Z"
}
],
"review_comments": [
{
"id": 2345678,
"user": {
"login": "reviewer2"
},
"body": "Consider using const instead of let here.",
"created_at": "2024-01-01T12:00:00Z",
"updated_at": "2024-01-01T12:00:00Z",
"path": "src/component.js",
"line": 42
}
]
}
]
```
## Analysis Examples
### Finding Code Review Patterns
```bash
# Download recent data
just gh-download-pr-comments-since "2024-01-01"
# Search for security-related discussions
just gh-search-pr-comments "security"
# Export to CSV for analysis
just gh-export-pr-comments-csv
# Generate statistics
just gh-show-pr-stats
```
### Team Collaboration Analysis
```bash
# Download all historical data
just gh-download-pr-comments
# View top contributors
just gh-show-pr-stats
# Search for mentoring patterns
just gh-search-pr-comments "looks good to me"
just gh-search-pr-comments "consider"
just gh-search-pr-comments "suggestion"
```
### Quality Assurance Tracking
```bash
# Search for quality-related discussions
just gh-search-pr-comments "test"
just gh-search-pr-comments "bug"
just gh-search-pr-comments "performance"
# Find PRs with extensive review activity
just gh-show-pr-stats | grep "PRs with Most Comments"
```
## Automation and CI/CD
### Scheduled Data Collection
Create a cron job or GitHub Action to collect data regularly:
```yaml
# .github/workflows/data-collection.yml
name: Weekly Data Collection
on:
schedule:
- cron: '0 2 * * 1' # Every Monday at 2 AM
jobs:
collect-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup GitHub CLI
run: |
gh auth login --with-token <<< "${{ secrets.GITHUB_TOKEN }}"
- name: Collect PR Comments
run: |
just gh-download-pr-comments-since "$(date -d '7 days ago' +%Y-%m-%d)"
- name: Archive Data
uses: actions/upload-artifact@v4
with:
name: pr-comments-data
path: data/github/
```
### Integration with Analytics Tools
The collected data can be integrated with various analytics tools:
1. **Jupyter Notebooks**: Load JSON data for Python analysis
2. **R/RStudio**: Import CSV for statistical analysis
3. **Tableau/Power BI**: Connect to CSV for visualization
4. **Google Sheets**: Import CSV for collaborative analysis
## Privacy and Security
### Data Sensitivity
- Comment data may contain sensitive information
- Review comments might include security discussions
- Author information is public but should be handled responsibly
### Recommendations
1. **Access Control**: Limit access to downloaded data files
2. **Data Retention**: Implement retention policies for old data
3. **Anonymization**: Consider anonymizing data for certain analyses
4. **Compliance**: Ensure compliance with organizational data policies
## Troubleshooting
### Common Issues
1. **Rate Limiting**: GitHub API has rate limits
- Solution: Add delays between requests if needed
- Check rate limit status: `gh api rate_limit`
2. **Large Repositories**: Many PRs may cause long download times
- Solution: Use date-based filtering
- Consider downloading in chunks
3. **Authentication Issues**: GitHub CLI not authenticated
- Solution: `gh auth login`
- Verify access: `gh auth status`
4. **Missing Dependencies**: `jq` not installed
- Ubuntu/Debian: `sudo apt install jq`
- macOS: `brew install jq`
### Performance Optimization
- Use date filtering for incremental updates
- Download single PRs for focused analysis
- Export to CSV for better tool compatibility
- Consider compressed storage for large datasets
## File Management
### Directory Structure
```
data/
└── github/
β”œβ”€β”€ pr_comments_20240315_143022.json # Full download
β”œβ”€β”€ pr_comments_20240315_143022.csv # CSV export
β”œβ”€β”€ pr_123_comments.json # Single PR
└── pr_comments_since_20240301_*.json # Date-filtered
```
### Cleanup
```bash
# Remove old data files (older than 30 days)
find data/github -name "*.json" -mtime +30 -delete
# Keep only the 5 most recent files
ls -t data/github/pr_comments_*.json | tail -n +6 | xargs rm -f
```
## Integration with Development Workflow
These recipes can enhance your development workflow by:
1. **Code Review Analysis**: Understanding review patterns and quality
2. **Team Performance**: Measuring collaboration and engagement
3. **Knowledge Management**: Tracking decisions and discussions
4. **Process Improvement**: Identifying bottlenecks and inefficiencies
5. **Onboarding**: Helping new team members understand project history
The GitHub data collection system provides powerful insights into your development process and team collaboration patterns.