File size: 6,440 Bytes
ae85f11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# IQuest-Loop-Instruct GGUF Conversion Summary

**Date**: 2026-01-07
**Model**: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
**Achievement**: World's first IQuest-Loop-Instruct GGUF conversion

## Files Created

| File | Size | Format | SHA256 | Completion Time |
|------|------|--------|--------|----------------|
| IQuest-Coder-V1-40B-Loop-Instruct-f16.gguf | 75GB | F16 | `b70d3bb48753e786c8afca7556b818341fc9258e29083be4b0375c5a8b788289` | 2m 6s |
| IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf | 23GB | Q4_K_M | `b665999c8d6660ba0ea29cbbb072056052ef965a233ef65661ec16a16b39a9e3` | 2m 23s |
| IQuest-Coder-V1-40B-Loop-Instruct-q5_k_m.gguf | 27GB | Q5_K_M | `a15814998038c8c6334f69bc11b776bce785350c933ce95fe9c41c4c7ec708ba` | 1m 41s |
| IQuest-Coder-V1-40B-Loop-Instruct-q8_0.gguf | 40GB | Q8_0 | `a9323b7ca583a842737dd4ec1f7422101c68ededf2a86c75a8d5e9da70eaae06` | 53s |

## Technical Implementation

### Architecture Support

Created `IQuestLoopCoderModel` class in llama.cpp's `convert_hf_to_gguf.py`:
- Inherits from `LlamaModel` (compatible architecture base)
- Maps 160 loop-specific `gate_projections` tensors to GGUF format
- Preserves loop parameters in metadata:
  - `llama.loop.num`: 2
  - `llama.loop.window_size`: 64

### Tensor Mapping

**Gate Projections** (160 tensors total):
- Source: `model.gate_projections.{0-79}.{weight|bias}`
- Shape: `[128, 40]` weight + `[40]` bias per layer
- Target: `blk.{layer}.loop_gate.{weight|bias}`
- Quantization: Uses fallback q5_0/q5_1 for Q4_K_M/Q5_K_M (tensors too small for standard quantization)

**Standard Tensors** (721 tensors):
- Uses LlamaModel's standard tensor mapping
- Attention: Q, K, V, Output projections
- FFN: Gate, Up, Down projections
- Normalization: Attention & FFN RMS norms

## Conversion Statistics

- **Total Tensors**: 883
  - Standard Llama: 721
  - Loop Gates: 160 (80 layers × 2 per layer)
  - Embeddings: 2
- **Vocabulary Size**: 76,800 tokens
- **Context Length**: 131,072 tokens
- **Hidden Layers**: 80
- **Attention Heads**: 40 (8 KV heads)
- **Hidden Size**: 5,120
- **FFN Size**: 27,648

## Current Status

### What Works ✅

1. **Conversion**: Successfully converts HuggingFace → GGUF F16
2. **Quantization**: All standard quantization levels work (Q4_K_M, Q5_K_M, Q8_0, etc.)
3. **Metadata**: Loop parameters correctly stored in GGUF metadata
4. **Tensor Preservation**: All 883 tensors including loop gates successfully converted
5. **Ollama Import**: Ollama accepts and imports the GGUF file

### What Needs Work 🔧

1. **Runtime Support**: llama.cpp runtime needs loop attention mechanism implementation
2. **Inference**: Model loads but cannot run inference yet (loop gates not used)
3. **Testing**: Need to validate loop attention behavior matches original PyTorch

## Implementation Details

### Modified Files

**`/tmp/convert_hf_to_gguf.py`** (lines 2695-2733):
```python
@ModelBase.register("IQuestLoopCoderForCausalLM")
class IQuestLoopCoderModel(LlamaModel):
    """IQuest Loop Coder model with recurrent loop attention mechanism."""
    model_arch = gguf.MODEL_ARCH.LLAMA

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.loop_num = self.hparams.get('loop_num', 2)
        self.loop_window_size = self.hparams.get('loop_window_size', 64)

    def set_gguf_parameters(self):
        super().set_gguf_parameters()
        self.gguf_writer.add_uint32("llama.loop.num", self.loop_num)
        self.gguf_writer.add_uint32("llama.loop.window_size", self.loop_window_size)

    def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None):
        if "gate_projections" in name:
            parts = name.split('.')
            if len(parts) >= 4 and parts[1] == "gate_projections":
                layer_num = parts[2]
                param_type = parts[3]
                new_name = f"blk.{layer_num}.loop_gate.{param_type}"
                return [(new_name, data_torch)]
        return super().modify_tensors(data_torch, name, bid)
```

## Next Steps for Community

### For llama.cpp Maintainers

1. **Implement Loop Attention Runtime**:
   - Read `llama.loop.num` and `llama.loop.window_size` from GGUF metadata
   - Load `blk.{layer}.loop_gate.{weight|bias}` tensors
   - Implement recurrent loop attention mechanism in CUDA/CPU kernels
   - Reference: Original implementation at IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct

2. **Add Unit Tests**:
   - Verify tensor loading
   - Validate loop parameter reading
   - Test against PyTorch reference implementation

3. **Documentation**:
   - Add Loop architecture to supported models list
   - Document loop parameter usage
   - Provide conversion examples

### For Model Users

1. **Wait for Runtime Support**: These GGUFs will work once llama.cpp implements loop attention
2. **Use Regular Variant**: For immediate use, IQuest-Coder (non-Loop) is fully supported
3. **Contribute**: Help implement loop attention in llama.cpp runtime

## Performance Expectations (Once Runtime Supports Loop)

Based on quantization levels:

- **Q4_K_M (23GB)**: Recommended for most deployments, 30% of original size
- **Q5_K_M (27GB)**: Better quality, 35% of original size
- **Q8_0 (40GB)**: Excellent quality, 53% of original size, minimal loss
- **F16 (75GB)**: Full precision reference

## Docker Build System

**Image**: `avarok/dgx-spark-complete:latest`
**Base**: `dgx-vllm:cutlass-nvfp4-v15`
**Includes**:
- vLLM v15 with IQuest Loop Coder support
- llama.cpp with CUDA support
- Conversion scripts (convert_to_gguf.sh, quantize.sh)
- Optimized for NVIDIA GB10 (SM 12.1)

## References

- **Original Model**: https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
- **llama.cpp Issue**: #18517 - Request for Loop-Instruct support
- **PR Inspiration**: #18524 - Regular IQuestCoder support
- **Debugging Journey**: /workspace/builds/DEBUGGING_JOURNEY.md

## Credits

- **Hardware**: Dual NVIDIA DGX Spark with GB10 GPUs
- **Model**: IQuestLab team for Loop architecture innovation
- **Tools**: llama.cpp (ggerganov), vLLM team
- **First GGUF**: This conversion is the first Loop-Instruct variant in GGUF format

## Verification

SHA256 checksums provided for all files. Verify before use:
```bash
sha256sum IQuest-Coder-V1-40B-Loop-Instruct-*.gguf
```

---

**Status**: Conversion successful, runtime support pending
**Date**: 2026-01-07
**Next**: Submit PR to llama.cpp with implementation + publish to HuggingFace