File size: 6,440 Bytes
ae85f11 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | # IQuest-Loop-Instruct GGUF Conversion Summary
**Date**: 2026-01-07
**Model**: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
**Achievement**: World's first IQuest-Loop-Instruct GGUF conversion
## Files Created
| File | Size | Format | SHA256 | Completion Time |
|------|------|--------|--------|----------------|
| IQuest-Coder-V1-40B-Loop-Instruct-f16.gguf | 75GB | F16 | `b70d3bb48753e786c8afca7556b818341fc9258e29083be4b0375c5a8b788289` | 2m 6s |
| IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf | 23GB | Q4_K_M | `b665999c8d6660ba0ea29cbbb072056052ef965a233ef65661ec16a16b39a9e3` | 2m 23s |
| IQuest-Coder-V1-40B-Loop-Instruct-q5_k_m.gguf | 27GB | Q5_K_M | `a15814998038c8c6334f69bc11b776bce785350c933ce95fe9c41c4c7ec708ba` | 1m 41s |
| IQuest-Coder-V1-40B-Loop-Instruct-q8_0.gguf | 40GB | Q8_0 | `a9323b7ca583a842737dd4ec1f7422101c68ededf2a86c75a8d5e9da70eaae06` | 53s |
## Technical Implementation
### Architecture Support
Created `IQuestLoopCoderModel` class in llama.cpp's `convert_hf_to_gguf.py`:
- Inherits from `LlamaModel` (compatible architecture base)
- Maps 160 loop-specific `gate_projections` tensors to GGUF format
- Preserves loop parameters in metadata:
- `llama.loop.num`: 2
- `llama.loop.window_size`: 64
### Tensor Mapping
**Gate Projections** (160 tensors total):
- Source: `model.gate_projections.{0-79}.{weight|bias}`
- Shape: `[128, 40]` weight + `[40]` bias per layer
- Target: `blk.{layer}.loop_gate.{weight|bias}`
- Quantization: Uses fallback q5_0/q5_1 for Q4_K_M/Q5_K_M (tensors too small for standard quantization)
**Standard Tensors** (721 tensors):
- Uses LlamaModel's standard tensor mapping
- Attention: Q, K, V, Output projections
- FFN: Gate, Up, Down projections
- Normalization: Attention & FFN RMS norms
## Conversion Statistics
- **Total Tensors**: 883
- Standard Llama: 721
- Loop Gates: 160 (80 layers × 2 per layer)
- Embeddings: 2
- **Vocabulary Size**: 76,800 tokens
- **Context Length**: 131,072 tokens
- **Hidden Layers**: 80
- **Attention Heads**: 40 (8 KV heads)
- **Hidden Size**: 5,120
- **FFN Size**: 27,648
## Current Status
### What Works ✅
1. **Conversion**: Successfully converts HuggingFace → GGUF F16
2. **Quantization**: All standard quantization levels work (Q4_K_M, Q5_K_M, Q8_0, etc.)
3. **Metadata**: Loop parameters correctly stored in GGUF metadata
4. **Tensor Preservation**: All 883 tensors including loop gates successfully converted
5. **Ollama Import**: Ollama accepts and imports the GGUF file
### What Needs Work 🔧
1. **Runtime Support**: llama.cpp runtime needs loop attention mechanism implementation
2. **Inference**: Model loads but cannot run inference yet (loop gates not used)
3. **Testing**: Need to validate loop attention behavior matches original PyTorch
## Implementation Details
### Modified Files
**`/tmp/convert_hf_to_gguf.py`** (lines 2695-2733):
```python
@ModelBase.register("IQuestLoopCoderForCausalLM")
class IQuestLoopCoderModel(LlamaModel):
"""IQuest Loop Coder model with recurrent loop attention mechanism."""
model_arch = gguf.MODEL_ARCH.LLAMA
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.loop_num = self.hparams.get('loop_num', 2)
self.loop_window_size = self.hparams.get('loop_window_size', 64)
def set_gguf_parameters(self):
super().set_gguf_parameters()
self.gguf_writer.add_uint32("llama.loop.num", self.loop_num)
self.gguf_writer.add_uint32("llama.loop.window_size", self.loop_window_size)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None):
if "gate_projections" in name:
parts = name.split('.')
if len(parts) >= 4 and parts[1] == "gate_projections":
layer_num = parts[2]
param_type = parts[3]
new_name = f"blk.{layer_num}.loop_gate.{param_type}"
return [(new_name, data_torch)]
return super().modify_tensors(data_torch, name, bid)
```
## Next Steps for Community
### For llama.cpp Maintainers
1. **Implement Loop Attention Runtime**:
- Read `llama.loop.num` and `llama.loop.window_size` from GGUF metadata
- Load `blk.{layer}.loop_gate.{weight|bias}` tensors
- Implement recurrent loop attention mechanism in CUDA/CPU kernels
- Reference: Original implementation at IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
2. **Add Unit Tests**:
- Verify tensor loading
- Validate loop parameter reading
- Test against PyTorch reference implementation
3. **Documentation**:
- Add Loop architecture to supported models list
- Document loop parameter usage
- Provide conversion examples
### For Model Users
1. **Wait for Runtime Support**: These GGUFs will work once llama.cpp implements loop attention
2. **Use Regular Variant**: For immediate use, IQuest-Coder (non-Loop) is fully supported
3. **Contribute**: Help implement loop attention in llama.cpp runtime
## Performance Expectations (Once Runtime Supports Loop)
Based on quantization levels:
- **Q4_K_M (23GB)**: Recommended for most deployments, 30% of original size
- **Q5_K_M (27GB)**: Better quality, 35% of original size
- **Q8_0 (40GB)**: Excellent quality, 53% of original size, minimal loss
- **F16 (75GB)**: Full precision reference
## Docker Build System
**Image**: `avarok/dgx-spark-complete:latest`
**Base**: `dgx-vllm:cutlass-nvfp4-v15`
**Includes**:
- vLLM v15 with IQuest Loop Coder support
- llama.cpp with CUDA support
- Conversion scripts (convert_to_gguf.sh, quantize.sh)
- Optimized for NVIDIA GB10 (SM 12.1)
## References
- **Original Model**: https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
- **llama.cpp Issue**: #18517 - Request for Loop-Instruct support
- **PR Inspiration**: #18524 - Regular IQuestCoder support
- **Debugging Journey**: /workspace/builds/DEBUGGING_JOURNEY.md
## Credits
- **Hardware**: Dual NVIDIA DGX Spark with GB10 GPUs
- **Model**: IQuestLab team for Loop architecture innovation
- **Tools**: llama.cpp (ggerganov), vLLM team
- **First GGUF**: This conversion is the first Loop-Instruct variant in GGUF format
## Verification
SHA256 checksums provided for all files. Verify before use:
```bash
sha256sum IQuest-Coder-V1-40B-Loop-Instruct-*.gguf
```
---
**Status**: Conversion successful, runtime support pending
**Date**: 2026-01-07
**Next**: Submit PR to llama.cpp with implementation + publish to HuggingFace
|