LocateAnything-3B

LocateAnything is a vision-language model for visual perception tasks including object detection, phrase grounding, text grounding, scene text detection, document layout analysis, GUI grounding, and pointing. It combines a MoonViT vision encoder with a Qwen2.5-3B language model and supports Multi-Token Prediction (MTP) for accelerated inference.

Model Architecture

Component	Details
Vision Encoder	MoonViT-SO-400M (27 layers, hidden_size=1152, patch_size=14)
Language Model	Qwen2.5-3B-Instruct (36 layers, hidden_size=2048)
Connector	2-layer MLP with pixel-shuffle (4x downsample → project to LLM dim)
Precision	bfloat16
Max Image Tokens	4096

File Overview

File	Description
`modeling_eagle3vl.py`	Main model class `Eagle3VLForConditionalGeneration` with forward pass and MTP/AR generation loop
`configuration_eagle3vl.py`	`Eagle3VLConfig` combining vision and text configs, plus special token IDs
`modeling_vit.py`	MoonViT vision encoder with 2D RoPE, flash attention, and patch merger
`modeling_qwen2.py`	Modified Qwen2 language model with MTP-aware attention masking
`generate_utils.py`	Token sampling, bounding-box decoding (top-k weighted average), and pattern matching utilities
`processing_eagle3vl.py`	`Eagle3VLProcessor` for preprocessing images/videos and building chat-formatted prompts
`image_processing_eagle3vl.py`	`Eagle3VLImageProcessor` for image rescaling, normalization, and patchification
`mask_sdpa_utils.py`	SDPA attention mask utilities for MTP block-diffusion masking
`mask_magi_utils.py`	MagiAttention range-based mask builder for efficient MTP attention
`attn_mask_utils.py`	Lower-level 2D/4D attention mask construction helpers
`configuration_qwen2.py`	Extended Qwen2Config with MTP-specific fields (block_size, mask tokens, etc.)

Supported Tasks & Prompt Templates

Task	Output	Prompt Template
Object Detection	Box	`Locate all the instances that match the following description: [CATEGORIES].`
Phrase Grounding (single)	Single Box	`Locate a single instance that matches the following description: [PHRASE].`
Phrase Grounding (multi)	Multiple Boxes	`Locate all the instances that match the following description: [PHRASE].`
Text Grounding	Box	`Please locate the text referred as [PHRASE].`
Scene Text Detection	Box	`Detect all the text in box format.`
Document Layout Analysis	Box	`Detect all the objects in the image that belong to the category set: [CATEGORIES].`
GUI Grounding (box)	Box	`Locate the region that matches the following description: [PHRASE].`
GUI Grounding (point)	Point	`Point to: [PHRASE].`
Pointing	Point	`Point to: [PHRASE].`

Notes:

[PHRASE] is a free-form natural language description (e.g., "the red car on the left").
[CATEGORIES] is a comma-separated list of category names (e.g., "person, car, dog").
Box output format: <box><x1><y1><x2><y2></box> where coordinates are in range [0, 1000].
Point output format: <box><x><y></box>.
Empty detection result: <box>none</box>.

Quick Start

Installation

pip install transformers torch torchvision pillow peft flash-attn

Basic Inference

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model_path = "nvidia/LocateAnything-3B"

model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).cuda().eval()

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
processor = model.processor if hasattr(model, 'processor') else None

# If processor is not bundled, load it separately:
if processor is None:
    from transformers import AutoProcessor
    processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image = Image.open("example.jpg").convert("RGB")

# --- Object Detection ---
# question = "Locate all the instances that match the following description: person</c>car</c>dog."

# --- Phrase Grounding ---
# question = "Locate all the instances that match the following description: the red car."

# --- Pointing ---
# question = "Point to: the stop sign."

messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": question},
    ]}
]

text = processor.py_apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = processor.process_vision_info(messages)
inputs = processor(text=[text], images=images, videos=videos, return_tensors="pt").to("cuda")

pixel_values = inputs["pixel_values"].to(torch.bfloat16)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
image_grid_hws = inputs.get("image_grid_hws", None)

response = model.generate(
    pixel_values=pixel_values,
    input_ids=input_ids,
    attention_mask=attention_mask,
    image_grid_hws=image_grid_hws,
    tokenizer=tokenizer,
    max_new_tokens=2048,
    use_cache=True,
    generation_mode="hybrid",   # "fast" (MTP only) | "slow" (AR only) | "hybrid" (MTP + AR fallback)
    temperature=0,
)

if isinstance(response, tuple):
    answer, sampling_history, stats = response
    print(f"Answer: {answer}")
    print(stats)
else:
    print(f"Answer: {response}")

Worker

Below is a self-contained worker script that loads the model once and serves queries via a simple function interface. You can integrate it with FastAPI, gRPC, or any serving framework.

"""
locateanything_worker.py - A reusable worker for LocateAnything inference.
"""
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoProcessor


class LocateAnythingWorker:
    """Stateful worker that loads the model once and serves perception queries."""

    def __init__(self, model_path: str, device: str = "cuda", dtype=torch.bfloat16):
        self.device = device
        self.dtype = dtype

        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
        self.model = AutoModel.from_pretrained(
            model_path,
            torch_dtype=dtype,
            trust_remote_code=True,
        ).to(device).eval()

    @torch.no_grad()
    def predict(
        self,
        image: Image.Image,
        question: str,
        generation_mode: str = "hybrid",
        max_new_tokens: int = 2048,
        temperature: float = 0.7,
        verbose: bool = True,
    ) -> dict:
        """
        Run a single perception query.

        Args:
            image: PIL Image (RGB).
            question: The task prompt (see supported prompts in README).
            generation_mode: "fast" | "slow" | "hybrid".
            max_new_tokens: Maximum tokens to generate.
            temperature: Sampling temperature (0 = greedy).
            verbose: If True, return timing statistics.

        Returns:
            dict with keys: "answer", "stats" (optional), "history" (optional).
        """
        messages = [
            {"role": "user", "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": question},
            ]}
        ]

        text = self.processor.py_apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        images, videos = self.processor.process_vision_info(messages)
        inputs = self.processor(
            text=[text], images=images, videos=videos, return_tensors="pt"
        ).to(self.device)

        pixel_values = inputs["pixel_values"].to(self.dtype)
        input_ids = inputs["input_ids"]
        image_grid_hws = inputs.get("image_grid_hws", None)

        response = self.model.generate(
            pixel_values=pixel_values,
            input_ids=input_ids,
            attention_mask=inputs["attention_mask"],
            image_grid_hws=image_grid_hws,
            tokenizer=self.tokenizer,
            max_new_tokens=max_new_tokens,
            use_cache=True,
            generation_mode=generation_mode,
            temperature=temperature,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.1,
            verbose=verbose,
        )

        result = {"answer": response[0] if isinstance(response, tuple) else response}
        if isinstance(response, tuple) and len(response) >= 3:
            result["history"] = response[1]
            result["stats"] = response[2]
        return result

    # ---- Convenience methods for each task ----

    def detect(self, image: Image.Image, categories: list[str], **kwargs) -> dict:
        """Object Detection / Document Layout Analysis."""
        cats = "</c>".join(categories)
        prompt = f"Locate all the instances that matches the following description: {cats}."
        return self.predict(image, prompt, **kwargs)

    def ground_single(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Phrase Grounding - single instance."""
        prompt = f"Locate a single instance that matches the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def ground_multi(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Phrase Grounding - multiple instances."""
        prompt = f"Locate all the instances that match the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def ground_text(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Text Grounding."""
        prompt = f"Please locate the text referred as {phrase}."
        return self.predict(image, prompt, **kwargs)

    def detect_text(self, image: Image.Image, **kwargs) -> dict:
        """Scene Text Detection."""
        prompt = "Detect all the text in box format."
        return self.predict(image, prompt, **kwargs)

    def ground_gui(self, image: Image.Image, phrase: str, output_type: str = "box", **kwargs) -> dict:
        """GUI Grounding (box or point)."""
        if output_type == "point":
            prompt = f"Point to: {phrase}."
        else:
            prompt = f"Locate the region that matches the following description: {phrase}."
        return self.predict(image, prompt, **kwargs)

    def point(self, image: Image.Image, phrase: str, **kwargs) -> dict:
        """Pointing."""
        prompt = f"Point to: {phrase}."
        return self.predict(image, prompt, **kwargs)


# --------------- Usage Example ---------------
if __name__ == "__main__":
    worker = LocateAnythingWorker("woshichaoren123/test001")
    img = Image.open("example.jpg").convert("RGB")

    # Object Detection
    result = worker.detect(img, ["person", "car", "bicycle"])
    print("Detection:", result["answer"])

    # Phrase Grounding (multiple)
    result = worker.ground_multi(img, "people wearing red shirts")
    print("Grounding:", result["answer"])

    # Scene Text Detection
    result = worker.detect_text(img)
    print("Text Detection:", result["answer"])

    # Pointing
    result = worker.point(img, "the traffic light")
    print("Pointing:", result["answer"])

    # GUI Grounding (point)
    result = worker.ground_gui(img, "the search button", output_type="point")
    print("GUI Point:", result["answer"])


## Output Format

The model outputs special tokens to represent bounding boxes and points:

- **Bounding box**: `<ref>label</ref><box><x1><y1><x2><y2></box>` where coordinates are integers in `[0, 1000]` representing normalized positions (divide by 1000 to get relative coordinates).
- **Point**: `<box><x><y></box>`
- **Empty / No object**: `<box>none</box>`
- Multiple objects are separated by consecutive `<ref>...<box>...</box>` sequences.

### Coordinate Conversion

```python
def parse_boxes(answer: str, image_width: int, image_height: int):
    """Parse model output into pixel-coordinate bounding boxes."""
    import re
    boxes = []
    for match in re.finditer(r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>", answer):
        x1, y1, x2, y2 = [int(g) for g in match.groups()]
        boxes.append({
            "x1": x1 / 1000 * image_width,
            "y1": y1 / 1000 * image_height,
            "x2": x2 / 1000 * image_width,
            "y2": y2 / 1000 * image_height,
        })
    return boxes

Generation Modes

Mode	Description	Speed	Accuracy
`fast`	MTP only, never falls back to AR	Fastest	Good for simple scenes
`slow`	Pure auto-regressive decoding	Slowest	Most robust
`hybrid` (default)	MTP first, falls back to AR on uncertain boxes, switches back after box boundary	Balanced	Best overall

Citation

@article{locateanything,
  title={LocateAnything: A Multi-Token Prediction Vision-Language Model for Perception},
  year={2025},
  institution={NVIDIA}
}

License

This project is licensed under the MIT License.

Downloads last month: 83

Safetensors

Model size

4B params

Tensor type

BF16

woshichaoren123
/

test001