nologik's picture
Add CONVERSION_SUMMARY.md
ae85f11 verified

IQuest-Loop-Instruct GGUF Conversion Summary

Date: 2026-01-07 Model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct Achievement: World's first IQuest-Loop-Instruct GGUF conversion

Files Created

File Size Format SHA256 Completion Time
IQuest-Coder-V1-40B-Loop-Instruct-f16.gguf 75GB F16 b70d3bb48753e786c8afca7556b818341fc9258e29083be4b0375c5a8b788289 2m 6s
IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf 23GB Q4_K_M b665999c8d6660ba0ea29cbbb072056052ef965a233ef65661ec16a16b39a9e3 2m 23s
IQuest-Coder-V1-40B-Loop-Instruct-q5_k_m.gguf 27GB Q5_K_M a15814998038c8c6334f69bc11b776bce785350c933ce95fe9c41c4c7ec708ba 1m 41s
IQuest-Coder-V1-40B-Loop-Instruct-q8_0.gguf 40GB Q8_0 a9323b7ca583a842737dd4ec1f7422101c68ededf2a86c75a8d5e9da70eaae06 53s

Technical Implementation

Architecture Support

Created IQuestLoopCoderModel class in llama.cpp's convert_hf_to_gguf.py:

  • Inherits from LlamaModel (compatible architecture base)
  • Maps 160 loop-specific gate_projections tensors to GGUF format
  • Preserves loop parameters in metadata:
    • llama.loop.num: 2
    • llama.loop.window_size: 64

Tensor Mapping

Gate Projections (160 tensors total):

  • Source: model.gate_projections.{0-79}.{weight|bias}
  • Shape: [128, 40] weight + [40] bias per layer
  • Target: blk.{layer}.loop_gate.{weight|bias}
  • Quantization: Uses fallback q5_0/q5_1 for Q4_K_M/Q5_K_M (tensors too small for standard quantization)

Standard Tensors (721 tensors):

  • Uses LlamaModel's standard tensor mapping
  • Attention: Q, K, V, Output projections
  • FFN: Gate, Up, Down projections
  • Normalization: Attention & FFN RMS norms

Conversion Statistics

  • Total Tensors: 883
    • Standard Llama: 721
    • Loop Gates: 160 (80 layers × 2 per layer)
    • Embeddings: 2
  • Vocabulary Size: 76,800 tokens
  • Context Length: 131,072 tokens
  • Hidden Layers: 80
  • Attention Heads: 40 (8 KV heads)
  • Hidden Size: 5,120
  • FFN Size: 27,648

Current Status

What Works ✅

  1. Conversion: Successfully converts HuggingFace → GGUF F16
  2. Quantization: All standard quantization levels work (Q4_K_M, Q5_K_M, Q8_0, etc.)
  3. Metadata: Loop parameters correctly stored in GGUF metadata
  4. Tensor Preservation: All 883 tensors including loop gates successfully converted
  5. Ollama Import: Ollama accepts and imports the GGUF file

What Needs Work 🔧

  1. Runtime Support: llama.cpp runtime needs loop attention mechanism implementation
  2. Inference: Model loads but cannot run inference yet (loop gates not used)
  3. Testing: Need to validate loop attention behavior matches original PyTorch

Implementation Details

Modified Files

/tmp/convert_hf_to_gguf.py (lines 2695-2733):

@ModelBase.register("IQuestLoopCoderForCausalLM")
class IQuestLoopCoderModel(LlamaModel):
    """IQuest Loop Coder model with recurrent loop attention mechanism."""
    model_arch = gguf.MODEL_ARCH.LLAMA

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.loop_num = self.hparams.get('loop_num', 2)
        self.loop_window_size = self.hparams.get('loop_window_size', 64)

    def set_gguf_parameters(self):
        super().set_gguf_parameters()
        self.gguf_writer.add_uint32("llama.loop.num", self.loop_num)
        self.gguf_writer.add_uint32("llama.loop.window_size", self.loop_window_size)

    def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None):
        if "gate_projections" in name:
            parts = name.split('.')
            if len(parts) >= 4 and parts[1] == "gate_projections":
                layer_num = parts[2]
                param_type = parts[3]
                new_name = f"blk.{layer_num}.loop_gate.{param_type}"
                return [(new_name, data_torch)]
        return super().modify_tensors(data_torch, name, bid)

Next Steps for Community

For llama.cpp Maintainers

  1. Implement Loop Attention Runtime:

    • Read llama.loop.num and llama.loop.window_size from GGUF metadata
    • Load blk.{layer}.loop_gate.{weight|bias} tensors
    • Implement recurrent loop attention mechanism in CUDA/CPU kernels
    • Reference: Original implementation at IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
  2. Add Unit Tests:

    • Verify tensor loading
    • Validate loop parameter reading
    • Test against PyTorch reference implementation
  3. Documentation:

    • Add Loop architecture to supported models list
    • Document loop parameter usage
    • Provide conversion examples

For Model Users

  1. Wait for Runtime Support: These GGUFs will work once llama.cpp implements loop attention
  2. Use Regular Variant: For immediate use, IQuest-Coder (non-Loop) is fully supported
  3. Contribute: Help implement loop attention in llama.cpp runtime

Performance Expectations (Once Runtime Supports Loop)

Based on quantization levels:

  • Q4_K_M (23GB): Recommended for most deployments, 30% of original size
  • Q5_K_M (27GB): Better quality, 35% of original size
  • Q8_0 (40GB): Excellent quality, 53% of original size, minimal loss
  • F16 (75GB): Full precision reference

Docker Build System

Image: avarok/dgx-spark-complete:latest Base: dgx-vllm:cutlass-nvfp4-v15 Includes:

  • vLLM v15 with IQuest Loop Coder support
  • llama.cpp with CUDA support
  • Conversion scripts (convert_to_gguf.sh, quantize.sh)
  • Optimized for NVIDIA GB10 (SM 12.1)

References

Credits

  • Hardware: Dual NVIDIA DGX Spark with GB10 GPUs
  • Model: IQuestLab team for Loop architecture innovation
  • Tools: llama.cpp (ggerganov), vLLM team
  • First GGUF: This conversion is the first Loop-Instruct variant in GGUF format

Verification

SHA256 checksums provided for all files. Verify before use:

sha256sum IQuest-Coder-V1-40B-Loop-Instruct-*.gguf

Status: Conversion successful, runtime support pending Date: 2026-01-07 Next: Submit PR to llama.cpp with implementation + publish to HuggingFace