Salesforce
/

CoDA-v0-Base

Text Generation

feature-extraction

text diffusion model

code generation

Model card Files Files and versions

hlnchen commited on Sep 26

Commit

aca1325

·

verified ·

1 Parent(s): 8ae6c69

Update README.md

Files changed (1) hide show

README.md +82 -3

README.md CHANGED Viewed

@@ -1,3 +1,82 @@
----
-license: cc-by-nc-4.0
----

+---
+license: cc-by-nc-4.0
+language:
+- en
+pipeline_tag: text-generation
+tags:
+- diffusion
+- text generation
+- code generation
+---
+# CoDA-v0-Base
+## Overview
+CoDA is Salesforce AI Research's open, lightweight and diffusion-based language model.
+[Technical Report (Coming soon)]()
+[Code](https://github.com/SalesforceAIResearch/CoDA/)
+## Requirements
+```
+torch==2.8.0
+transformers>=4.47.1
+flash-attn==2.8.3
+```
+## Quickstart
+Here is a code snippet for loading the model, tokenizer and run unmasking for a partially finished code.
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+model_name = "Salesforce/CoDA-v0-Base"
+device = "cuda"
+model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model.eval()
+prompt = """```python
+from typing import List
+class Solution:
+    def twoSum(self, nums: List[int], target: int) -> List[int]:
+        # Create a dictionary to store the numbers and their indices
+        num_to_index = {}
+        # Iterate over the list of numbers
+        for index, num in enumerate(nums):
+            # Calculate the complement
+            complement = target - num
+            # Check if the complement is already in the dictionary
+            if complement in num_to_index:
+                # If found, return the indices of the complement and the current number
+                return [num_to_index[complement], index]
+            # Otherwise, add the current number and its index to the dictionary
+            num_to_index[num] = index
+```"""
+input_ids = tokenizer.encode(prompt, return_tensors="pt")
+mask = torch.rand(input_ids.shape) < 0.4
+masked_input_ids = input_ids.clone()
+masked_input_ids[mask] = tokenizer.mask_token_id
+generated_ids = model.diffusion_generate(
+    inputs=masked_input_ids.to(model.device),
+    max_new_tokens=1,
+    steps=128,
+    top_p=0.95,
+    temperature=0.2,
+    alg="entropy",
+    alg_temp=0.2,
+)
+generated_ids = [
+    output_ids[:-1] for output_ids in generated_ids
+]
+unmasked_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+```