IQuest-Loop-Instruct GGUF Conversion Summary

Date: 2026-01-07 Model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct Achievement: World's first IQuest-Loop-Instruct GGUF conversion

Files Created

File	Size	Format	SHA256	Completion Time
IQuest-Coder-V1-40B-Loop-Instruct-f16.gguf	75GB	F16	`b70d3bb48753e786c8afca7556b818341fc9258e29083be4b0375c5a8b788289`	2m 6s
IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf	23GB	Q4_K_M	`b665999c8d6660ba0ea29cbbb072056052ef965a233ef65661ec16a16b39a9e3`	2m 23s
IQuest-Coder-V1-40B-Loop-Instruct-q5_k_m.gguf	27GB	Q5_K_M	`a15814998038c8c6334f69bc11b776bce785350c933ce95fe9c41c4c7ec708ba`	1m 41s
IQuest-Coder-V1-40B-Loop-Instruct-q8_0.gguf	40GB	Q8_0	`a9323b7ca583a842737dd4ec1f7422101c68ededf2a86c75a8d5e9da70eaae06`	53s

Technical Implementation

Architecture Support

Created IQuestLoopCoderModel class in llama.cpp's convert_hf_to_gguf.py:

Inherits from LlamaModel (compatible architecture base)
Maps 160 loop-specific gate_projections tensors to GGUF format
Preserves loop parameters in metadata:
- llama.loop.num: 2
- llama.loop.window_size: 64

Tensor Mapping

Gate Projections (160 tensors total):

Source: model.gate_projections.{0-79}.{weight|bias}
Shape: [128, 40] weight + [40] bias per layer
Target: blk.{layer}.loop_gate.{weight|bias}
Quantization: Uses fallback q5_0/q5_1 for Q4_K_M/Q5_K_M (tensors too small for standard quantization)

Standard Tensors (721 tensors):

Uses LlamaModel's standard tensor mapping
Attention: Q, K, V, Output projections
FFN: Gate, Up, Down projections
Normalization: Attention & FFN RMS norms

Conversion Statistics

Total Tensors: 883
- Standard Llama: 721
- Loop Gates: 160 (80 layers × 2 per layer)
- Embeddings: 2
Vocabulary Size: 76,800 tokens
Context Length: 131,072 tokens
Hidden Layers: 80
Attention Heads: 40 (8 KV heads)
Hidden Size: 5,120
FFN Size: 27,648

Current Status

What Works ✅

Conversion: Successfully converts HuggingFace → GGUF F16
Quantization: All standard quantization levels work (Q4_K_M, Q5_K_M, Q8_0, etc.)
Metadata: Loop parameters correctly stored in GGUF metadata
Tensor Preservation: All 883 tensors including loop gates successfully converted
Ollama Import: Ollama accepts and imports the GGUF file

What Needs Work 🔧

Runtime Support: llama.cpp runtime needs loop attention mechanism implementation
Inference: Model loads but cannot run inference yet (loop gates not used)
Testing: Need to validate loop attention behavior matches original PyTorch

Implementation Details

Modified Files

/tmp/convert_hf_to_gguf.py (lines 2695-2733):

@ModelBase.register("IQuestLoopCoderForCausalLM")
class IQuestLoopCoderModel(LlamaModel):
    """IQuest Loop Coder model with recurrent loop attention mechanism."""
    model_arch = gguf.MODEL_ARCH.LLAMA

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.loop_num = self.hparams.get('loop_num', 2)
        self.loop_window_size = self.hparams.get('loop_window_size', 64)

    def set_gguf_parameters(self):
        super().set_gguf_parameters()
        self.gguf_writer.add_uint32("llama.loop.num", self.loop_num)
        self.gguf_writer.add_uint32("llama.loop.window_size", self.loop_window_size)

    def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None):
        if "gate_projections" in name:
            parts = name.split('.')
            if len(parts) >= 4 and parts[1] == "gate_projections":
                layer_num = parts[2]
                param_type = parts[3]
                new_name = f"blk.{layer_num}.loop_gate.{param_type}"
                return [(new_name, data_torch)]
        return super().modify_tensors(data_torch, name, bid)

Next Steps for Community

For llama.cpp Maintainers

Implement Loop Attention Runtime:
- Read llama.loop.num and llama.loop.window_size from GGUF metadata
- Load blk.{layer}.loop_gate.{weight|bias} tensors
- Implement recurrent loop attention mechanism in CUDA/CPU kernels
- Reference: Original implementation at IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
Add Unit Tests:
- Verify tensor loading
- Validate loop parameter reading
- Test against PyTorch reference implementation
Documentation:
- Add Loop architecture to supported models list
- Document loop parameter usage
- Provide conversion examples

For Model Users

Wait for Runtime Support: These GGUFs will work once llama.cpp implements loop attention
Use Regular Variant: For immediate use, IQuest-Coder (non-Loop) is fully supported
Contribute: Help implement loop attention in llama.cpp runtime

Performance Expectations (Once Runtime Supports Loop)

Based on quantization levels:

Q4_K_M (23GB): Recommended for most deployments, 30% of original size
Q5_K_M (27GB): Better quality, 35% of original size
Q8_0 (40GB): Excellent quality, 53% of original size, minimal loss
F16 (75GB): Full precision reference

Docker Build System

Image: avarok/dgx-spark-complete:latest Base: dgx-vllm:cutlass-nvfp4-v15 Includes:

vLLM v15 with IQuest Loop Coder support
llama.cpp with CUDA support
Conversion scripts (convert_to_gguf.sh, quantize.sh)
Optimized for NVIDIA GB10 (SM 12.1)

References

Original Model: https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
llama.cpp Issue: #18517 - Request for Loop-Instruct support
PR Inspiration: #18524 - Regular IQuestCoder support
Debugging Journey: /workspace/builds/DEBUGGING_JOURNEY.md

Credits

Hardware: Dual NVIDIA DGX Spark with GB10 GPUs
Model: IQuestLab team for Loop architecture innovation
Tools: llama.cpp (ggerganov), vLLM team
First GGUF: This conversion is the first Loop-Instruct variant in GGUF format

Verification

SHA256 checksums provided for all files. Verify before use:

sha256sum IQuest-Coder-V1-40B-Loop-Instruct-*.gguf

Status: Conversion successful, runtime support pending Date: 2026-01-07 Next: Submit PR to llama.cpp with implementation + publish to HuggingFace