LocateAnything-3B
LocateAnything is a vision-language model for visual perception tasks including object detection, phrase grounding, text grounding, scene text detection, document layout analysis, GUI grounding, and pointing. It combines a MoonViT vision encoder with a Qwen2.5-3B language model and supports Multi-Token Prediction (MTP) for accelerated inference.
Model Architecture
| Component | Details |
|---|---|
| Vision Encoder | MoonViT-SO-400M (27 layers, hidden_size=1152, patch_size=14) |
| Language Model | Qwen2.5-3B-Instruct (36 layers, hidden_size=2048) |
| Connector | 2-layer MLP with pixel-shuffle (4x downsample → project to LLM dim) |
| Precision | bfloat16 |
| Max Image Tokens | 4096 |
File Overview
| File | Description |
|---|---|
modeling_eagle3vl.py |
Main model class Eagle3VLForConditionalGeneration with forward pass and MTP/AR generation loop |
configuration_eagle3vl.py |
Eagle3VLConfig combining vision and text configs, plus special token IDs |
modeling_vit.py |
MoonViT vision encoder with 2D RoPE, flash attention, and patch merger |
modeling_qwen2.py |
Modified Qwen2 language model with MTP-aware attention masking |
generate_utils.py |
Token sampling, bounding-box decoding (top-k weighted average), and pattern matching utilities |
processing_eagle3vl.py |
Eagle3VLProcessor for preprocessing images/videos and building chat-formatted prompts |
image_processing_eagle3vl.py |
Eagle3VLImageProcessor for image rescaling, normalization, and patchification |
mask_sdpa_utils.py |
SDPA attention mask utilities for MTP block-diffusion masking |
mask_magi_utils.py |
MagiAttention range-based mask builder for efficient MTP attention |
attn_mask_utils.py |
Lower-level 2D/4D attention mask construction helpers |
configuration_qwen2.py |
Extended Qwen2Config with MTP-specific fields (block_size, mask tokens, etc.) |
Supported Tasks & Prompt Templates
| Task | Output | Prompt Template |
|---|---|---|
| Object Detection | Box | Locate all the instances that match the following description: [CATEGORIES]. |
| Phrase Grounding (single) | Single Box | Locate a single instance that matches the following description: [PHRASE]. |
| Phrase Grounding (multi) | Multiple Boxes | Locate all the instances that match the following description: [PHRASE]. |
| Text Grounding | Box | Please locate the text referred as [PHRASE]. |
| Scene Text Detection | Box | Detect all the text in box format. |
| Document Layout Analysis | Box | Detect all the objects in the image that belong to the category set: [CATEGORIES]. |
| GUI Grounding (box) | Box | Locate the region that matches the following description: [PHRASE]. |
| GUI Grounding (point) | Point | Point to: [PHRASE]. |
| Pointing | Point | Point to: [PHRASE]. |
Notes:
[PHRASE]is a free-form natural language description (e.g.,"the red car on the left").[CATEGORIES]is a comma-separated list of category names (e.g.,"person, car, dog").- Box output format:
<box><x1><y1><x2><y2></box>where coordinates are in range[0, 1000]. - Point output format:
<box><x><y></box>. - Empty detection result:
<box>none</box>.
Quick Start
Installation
pip install transformers torch torchvision pillow peft flash-attn
Basic Inference
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model_path = "nvidia/LocateAnything-3B"
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
processor = model.processor if hasattr(model, 'processor') else None
# If processor is not bundled, load it separately:
if processor is None:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image = Image.open("example.jpg").convert("RGB")
# --- Object Detection ---
# question = "Locate all the instances that match the following description: person</c>car</c>dog."
# --- Phrase Grounding ---
# question = "Locate all the instances that match the following description: the red car."
# --- Pointing ---
# question = "Point to: the stop sign."
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": question},
]}
]
text = processor.py_apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = processor.process_vision_info(messages)
inputs = processor(text=[text], images=images, videos=videos, return_tensors="pt").to("cuda")
pixel_values = inputs["pixel_values"].to(torch.bfloat16)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
image_grid_hws = inputs.get("image_grid_hws", None)
response = model.generate(
pixel_values=pixel_values,
input_ids=input_ids,
attention_mask=attention_mask,
image_grid_hws=image_grid_hws,
tokenizer=tokenizer,
max_new_tokens=2048,
use_cache=True,
generation_mode="hybrid", # "fast" (MTP only) | "slow" (AR only) | "hybrid" (MTP + AR fallback)
temperature=0,
)
if isinstance(response, tuple):
answer, sampling_history, stats = response
print(f"Answer: {answer}")
print(stats)
else:
print(f"Answer: {response}")
Worker
Below is a self-contained worker script that loads the model once and serves queries via a simple function interface. You can integrate it with FastAPI, gRPC, or any serving framework.
"""
locateanything_worker.py - A reusable worker for LocateAnything inference.
"""
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoProcessor
class LocateAnythingWorker:
"""Stateful worker that loads the model once and serves perception queries."""
def __init__(self, model_path: str, device: str = "cuda", dtype=torch.bfloat16):
self.device = device
self.dtype = dtype
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
self.model = AutoModel.from_pretrained(
model_path,
torch_dtype=dtype,
trust_remote_code=True,
).to(device).eval()
@torch.no_grad()
def predict(
self,
image: Image.Image,
question: str,
generation_mode: str = "hybrid",
max_new_tokens: int = 2048,
temperature: float = 0.7,
verbose: bool = True,
) -> dict:
"""
Run a single perception query.
Args:
image: PIL Image (RGB).
question: The task prompt (see supported prompts in README).
generation_mode: "fast" | "slow" | "hybrid".
max_new_tokens: Maximum tokens to generate.
temperature: Sampling temperature (0 = greedy).
verbose: If True, return timing statistics.
Returns:
dict with keys: "answer", "stats" (optional), "history" (optional).
"""
messages = [
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": question},
]}
]
text = self.processor.py_apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
images, videos = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text], images=images, videos=videos, return_tensors="pt"
).to(self.device)
pixel_values = inputs["pixel_values"].to(self.dtype)
input_ids = inputs["input_ids"]
image_grid_hws = inputs.get("image_grid_hws", None)
response = self.model.generate(
pixel_values=pixel_values,
input_ids=input_ids,
attention_mask=inputs["attention_mask"],
image_grid_hws=image_grid_hws,
tokenizer=self.tokenizer,
max_new_tokens=max_new_tokens,
use_cache=True,
generation_mode=generation_mode,
temperature=temperature,
do_sample=True,
top_p=0.9,
repetition_penalty=1.1,
verbose=verbose,
)
result = {"answer": response[0] if isinstance(response, tuple) else response}
if isinstance(response, tuple) and len(response) >= 3:
result["history"] = response[1]
result["stats"] = response[2]
return result
# ---- Convenience methods for each task ----
def detect(self, image: Image.Image, categories: list[str], **kwargs) -> dict:
"""Object Detection / Document Layout Analysis."""
cats = "</c>".join(categories)
prompt = f"Locate all the instances that matches the following description: {cats}."
return self.predict(image, prompt, **kwargs)
def ground_single(self, image: Image.Image, phrase: str, **kwargs) -> dict:
"""Phrase Grounding - single instance."""
prompt = f"Locate a single instance that matches the following description: {phrase}."
return self.predict(image, prompt, **kwargs)
def ground_multi(self, image: Image.Image, phrase: str, **kwargs) -> dict:
"""Phrase Grounding - multiple instances."""
prompt = f"Locate all the instances that match the following description: {phrase}."
return self.predict(image, prompt, **kwargs)
def ground_text(self, image: Image.Image, phrase: str, **kwargs) -> dict:
"""Text Grounding."""
prompt = f"Please locate the text referred as {phrase}."
return self.predict(image, prompt, **kwargs)
def detect_text(self, image: Image.Image, **kwargs) -> dict:
"""Scene Text Detection."""
prompt = "Detect all the text in box format."
return self.predict(image, prompt, **kwargs)
def ground_gui(self, image: Image.Image, phrase: str, output_type: str = "box", **kwargs) -> dict:
"""GUI Grounding (box or point)."""
if output_type == "point":
prompt = f"Point to: {phrase}."
else:
prompt = f"Locate the region that matches the following description: {phrase}."
return self.predict(image, prompt, **kwargs)
def point(self, image: Image.Image, phrase: str, **kwargs) -> dict:
"""Pointing."""
prompt = f"Point to: {phrase}."
return self.predict(image, prompt, **kwargs)
# --------------- Usage Example ---------------
if __name__ == "__main__":
worker = LocateAnythingWorker("woshichaoren123/test001")
img = Image.open("example.jpg").convert("RGB")
# Object Detection
result = worker.detect(img, ["person", "car", "bicycle"])
print("Detection:", result["answer"])
# Phrase Grounding (multiple)
result = worker.ground_multi(img, "people wearing red shirts")
print("Grounding:", result["answer"])
# Scene Text Detection
result = worker.detect_text(img)
print("Text Detection:", result["answer"])
# Pointing
result = worker.point(img, "the traffic light")
print("Pointing:", result["answer"])
# GUI Grounding (point)
result = worker.ground_gui(img, "the search button", output_type="point")
print("GUI Point:", result["answer"])
## Output Format
The model outputs special tokens to represent bounding boxes and points:
- **Bounding box**: `<ref>label</ref><box><x1><y1><x2><y2></box>` where coordinates are integers in `[0, 1000]` representing normalized positions (divide by 1000 to get relative coordinates).
- **Point**: `<box><x><y></box>`
- **Empty / No object**: `<box>none</box>`
- Multiple objects are separated by consecutive `<ref>...<box>...</box>` sequences.
### Coordinate Conversion
```python
def parse_boxes(answer: str, image_width: int, image_height: int):
"""Parse model output into pixel-coordinate bounding boxes."""
import re
boxes = []
for match in re.finditer(r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>", answer):
x1, y1, x2, y2 = [int(g) for g in match.groups()]
boxes.append({
"x1": x1 / 1000 * image_width,
"y1": y1 / 1000 * image_height,
"x2": x2 / 1000 * image_width,
"y2": y2 / 1000 * image_height,
})
return boxes
Generation Modes
| Mode | Description | Speed | Accuracy |
|---|---|---|---|
fast |
MTP only, never falls back to AR | Fastest | Good for simple scenes |
slow |
Pure auto-regressive decoding | Slowest | Most robust |
hybrid (default) |
MTP first, falls back to AR on uncertain boxes, switches back after box boundary | Balanced | Best overall |
Citation
@article{locateanything,
title={LocateAnything: A Multi-Token Prediction Vision-Language Model for Perception},
year={2025},
institution={NVIDIA}
}
License
This project is licensed under the MIT License.
- Downloads last month
- 83