all-MiniLM-L6-v2-code-search-512

Version: v1.0
Release Date: 2026-01-04
Base Model: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Overview

all-MiniLM-L6-v2-code-search-512 is a lightweight, high-accuracy sentence-transformers fine-tuned specifically for semantic code search and code embeddings.
It maps natural language queries and source code into a shared embedding space, enabling fast and accurate retrieval of relevant code snippets across multiple programming languages.
On CodeSearchNet, this model achieves 90.3% Accuracy@1, significantly outperforming the base all-MiniLM-L6-v2 model while keeping the same 33M parameter footprint.

Key Features

Fine-tuned on 1.29M+ code–documentation pairs from CodeSearchNet
Supports 512-token context length (2× longer than the base model)
Multi-language code support (Python, Java, JavaScript, PHP, Ruby, Go)
33M parameters for fast inference
Pre-configured with max_seq_length=512 by default

Intended Use

Recommended use cases:

Semantic code search
Natural language to code retrieval
Code–documentation matching
Code similarity and clustering
Developer tools and IDE integrations

Supported languages:

Python
Java
JavaScript
PHP
Ruby
Go
Other languages included in CodeSearchNet

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('isuruwijesiri/all-MiniLM-L6-v2-code-search-512')

# Encode code
code_embedding = model.encode("def hello_world():\n    print('Hello, World!')")

# Encode natural language description
query_embedding = model.encode("function that prints hello world")

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([query_embedding], [code_embedding])[0][0]
print(f"Similarity: {similarity:.4f}")

Quick Start - JS

Use this model in JavaScript/TypeScript with Transformers.js - works in Node.js and browsers!

Installation:

npm install @xenova/transformers

Usage:

import {{ pipeline }} from '@xenova/transformers';

// Load the model (auto-downloads and caches)
const extractor = await pipeline('feature-extraction', 'isuruwijesiri/all-MiniLM-L6-v2-code-search-512', {
  quantized: false  // false = model.onnx (86MB), true = model_quantized.onnx (22MB)
});

// Generate embeddings
const output = await extractor('def add(a, b): return a + b', {{
  pooling: 'mean',
  normalize: true
}});

// Use the embedding
const embedding = Array.from(output.data);
console.log(embedding); // [384] dimensional vector

Code Search Example

from sentence_transformers import SentenceTransformer, util

# Load model
model = SentenceTransformer('isuruwijesiri/all-MiniLM-L6-v2-code-search-512')

# Your code snippets
code_snippets = [
    "def calculate_sum(a, b):\n    return a + b",
    "def find_max(numbers):\n    return max(numbers)",
    "class User:\n    def __init__(self, name):\n        self.name = name"
]

# Natural language query
query = "function to add two numbers"

# Encode
query_emb = model.encode(query, convert_to_tensor=True)
code_embs = model.encode(code_snippets, convert_to_tensor=True)

# Find most similar
similarities = util.cos_sim(query_emb, code_embs)[0]
best_match_idx = similarities.argmax().item()

print(f"Best match: {code_snippets[best_match_idx]}")
print(f"Similarity: {similarities[best_match_idx]:.4f}")

Code Search Example - JS

import {{ pipeline }} from '@xenova/transformers';

const extractor = await pipeline('feature-extraction', 'isuruwijesiri/all-MiniLM-L6-v2-code-search-512', {
  quantized: false  // false = model.onnx (86MB), true = model_quantized.onnx (22MB)
});

// Your code snippets
const codeSnippets = [
  "def add(a, b): return a + b",
  "def multiply(x, y): return x * y",
  "class User: pass"
];

// Search query
const query = "function to add two numbers";

// Get embeddings
const queryEmb = await extractor(query, {{ pooling: 'mean', normalize: true }});
const codeEmbs = await Promise.all(
  codeSnippets.map(code => extractor(code, {{ pooling: 'mean', normalize: true }}))
);

// Calculate similarities (dot product of normalized vectors = cosine similarity)
const similarities = codeEmbs.map(codeEmb => 
  Array.from(queryEmb.data).reduce((sum, val, i) => 
    sum + val * codeEmb.data[i], 0
  )
);

console.log(similarities); // [0.87, 0.23, 0.15] - first one is most similar!

Performance

Evaluated on the CodeSearchNet validation set (2,000 samples).

Metric	Score
MRR@10	0.9259
Accuracy@1	0.9030
Accuracy@3	0.9440
Accuracy@5	0.9540
Accuracy@10	0.9705
Recall@1	0.9030
Recall@5	0.9540
Recall@10	0.9705
NDCG@10	0.9367
MAP@100	0.9269

This model significantly outperforms the base all-MiniLM-L6-v2 on code search tasks:

Metric	Base Model	Fine-tuned	Improvement
MRR@10	0.7759	0.9259	+0.1500 (+19.3%)
Accuracy@1	0.7175	0.9030	+0.1855 (+25.9%)
Accuracy@5	0.8555	0.9540	+0.0985 (+11.5%)
Recall@10	0.8840	0.9705	+0.0865 (+9.8%)
NDCG@10	0.8022	0.9367	+0.1344 (+16.8%)

Context Length Configuration

This model was trained with a maximum sequence length of 512 tokens. By default, max_seq_length is already set to 512. No manual configuration is required.

model.max_seq_length = 256  # Faster inference for short code
model.max_seq_length = 512  # Default, best accuracy (recommended)

Training Details

Dataset: CodeSearchNet
Training samples: 1,294,017
Evaluation samples: 2,000
Epochs: 3
Batch size: 192
Loss function: MultipleNegativesRankingLoss

Limitations

Inputs longer than 512 tokens are truncated
Not suitable for code generation or completion
Optimized for semantic similarity, not syntactic correctness
Performance may vary across programming languages

Citation

If you use this model, please cite:

@misc{all_MiniLM_L6_v2_code_search_512,
  author = {isuruwijesiri},
  title = {all-MiniLM-L6-v2-code-search-512},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/isuruwijesiri/all-MiniLM-L6-v2-code-search-512}
}

License: Apache 2.0

Downloads last month: 24

Safetensors

Model size

22.7M params

Tensor type

F32

Model tree for isuruwijesiri/all-MiniLM-L6-v2-code-search-512

Base model

sentence-transformers/all-MiniLM-L6-v2

Quantized

(64)

this model

isuruwijesiri
/

all-MiniLM-L6-v2-code-search-512