Add better infilling documentation
Browse files
README.md
CHANGED
|
@@ -40,7 +40,7 @@ CarperAI will be releasing larger LMs better tuned for code in the near future,
|
|
| 40 |
| \\(n_{heads}\\) | 16 |
|
| 41 |
| \\(d_{head}\\) | 128 |
|
| 42 |
| \\(n_{ctx}\\) | 2048 |
|
| 43 |
-
| \\(n_{vocab}\\) |
|
| 44 |
| Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)
|
| 45 |
|
| 46 |
|
|
@@ -105,27 +105,59 @@ language model output is generated after \<MID\> token!
|
|
| 105 |
|
| 106 |
As a concrete example, here is a code snippet that should allow a model to perform infilling:
|
| 107 |
|
| 108 |
-
|
| 109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
-
|
| 112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
|
|
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
-
|
|
|
|
| 122 |
|
| 123 |
-
|
|
|
|
|
|
|
| 124 |
|
|
|
|
|
|
|
|
|
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
```
|
|
|
|
|
|
|
| 127 |
|
| 128 |
-
|
| 129 |
|
| 130 |
|
| 131 |
## Intended Uses and Limitations
|
|
|
|
| 40 |
| \\(n_{heads}\\) | 16 |
|
| 41 |
| \\(d_{head}\\) | 128 |
|
| 42 |
| \\(n_{ctx}\\) | 2048 |
|
| 43 |
+
| \\(n_{vocab}\\) | 50280 |
|
| 44 |
| Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)
|
| 45 |
|
| 46 |
|
|
|
|
| 105 |
|
| 106 |
As a concrete example, here is a code snippet that should allow a model to perform infilling:
|
| 107 |
|
| 108 |
+
There was an issue where the sentinel `<|SUF|>`, `<|PRE|>`, and `<|MID|>` tokens were not the correct ids in the uploaded tokenizer and model card! Please try clearing the Huggingface cache and redownloading the model :))
|
| 109 |
|
| 110 |
+
Here is a minimal example of performing open-ended generation with this model, on a simple function `score(x, y)`:
|
| 111 |
+
```
|
| 112 |
+
def score(x,y) -> int:
|
| 113 |
+
"""
|
| 114 |
+
|
| 115 |
+
```
|
| 116 |
|
| 117 |
+
and also infilling with the function and end of docstring already placed:
|
| 118 |
|
| 119 |
+
```
|
| 120 |
+
def score(x,y) -> int:
|
| 121 |
+
"""
|
| 122 |
+
<|MID|> (infill here)
|
| 123 |
+
"""
|
| 124 |
|
| 125 |
+
score = x + y
|
| 126 |
+
return score
|
| 127 |
+
```
|
| 128 |
|
| 129 |
+
```
|
| 130 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 131 |
+
import torch
|
| 132 |
+
|
| 133 |
+
model = AutoModelForCausalLM.from_pretrained("CarperAI/FIM-NeoX-1.3B")
|
| 134 |
+
tok = AutoTokenizer.from_pretrained("CarperAI/
|
| 135 |
|
| 136 |
+
# infilling demo
|
| 137 |
+
prefix = 'def score(x, y) -> int:\n"""\n'
|
| 138 |
+
suffix = '"""\n\n score = x + y\n return score'
|
| 139 |
|
| 140 |
+
model_input = [50277, *tok(suffix)["input_ids"], 50278, *tok(prefix)["input_ids"], 50279]
|
| 141 |
+
output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=40)[0])
|
| 142 |
|
| 143 |
+
print(output)
|
| 144 |
+
```
|
| 145 |
+
outputs: `'<|SUF|>"""\n\n score = x + y\n return score<|PRE|>def score(x, y) -> int:\n"""\n<|MID|> score(x, y) -> int\n<|endoftext|>'`
|
| 146 |
|
| 147 |
+
```
|
| 148 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 149 |
+
import torch
|
| 150 |
|
| 151 |
+
# non-infilling demo
|
| 152 |
+
prefix = 'def score(x, y) -> int:\n"""\n'
|
| 153 |
+
model_input = [*tok(prefix)["input_ids"]]
|
| 154 |
+
output = tok.decode(model.generate(torch.IntTensor(model_input).unsqueeze(0), max_length=100)[0])
|
| 155 |
+
print(output)
|
| 156 |
```
|
| 157 |
+
outputs: `'def score(x, y) -> int:\n"""\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y_list))\n\ndef get_point_score(x, y) -> int:\n """\n Return the score of the given point.\n """\n return sum(x * y for x, y in zip(x_list, y'`
|
| 158 |
+
|
| 159 |
|
| 160 |
+
The sentinel tokens are now accessible via `tokenizer.decode(50277) = "<|SUF|>"`, `tokenizer.decode(50278) = "<|PRE|>"`, `tokenizer.decode(50279) = "<|MID|>"`.
|
| 161 |
|
| 162 |
|
| 163 |
## Intended Uses and Limitations
|