Spaces:
Running
Running
| <!-- livebook:{"app_settings":{"access_type":"public","auto_shutdown_ms":5000,"multi_session":true,"output_type":"rich","show_source":true,"slug":"tokenizer-generator"}} --> | |
| # Tokenizer generator | |
| ```elixir | |
| Mix.install([ | |
| {:kino, "~> 0.10.0"}, | |
| {:req, "~> 0.4.3"} | |
| ]) | |
| ``` | |
| ## Info | |
| ```elixir | |
| Kino.Markdown.new(""" | |
| ## Background | |
| HuggingFace repositories store tokenizers in two flavours: | |
| 1. "slow tokenizer" - corresponds to a tokenizer implemented in Python | |
| and stored as `tokenizer_config.json` | |
| 2. "fast tokenizers" - corresponds to a tokenizer implemented in Rust | |
| and stored as `tokenizer.json` | |
| Many repositories only include files for 1., but the `transformers` library | |
| automatically converts "slow tokenizer" to "fast tokenizer" whenever possible. | |
| Bumblebee relies on the Rust bindings and therefore always requires the | |
| `tokenizer.json` file. This app generates that file for any repository with the | |
| "slow tokenizer". | |
| """) | |
| ``` | |
| ## Generator | |
| ```elixir | |
| Kino.Markdown.new("## Converter") | |
| ``` | |
| ```elixir | |
| {version, 0} = | |
| System.cmd("python", ["-c", "import transformers; print(transformers.__version__, end='')"]) | |
| Kino.Markdown.new(""" | |
| `tokenizers: #{version}` | |
| """) | |
| ``` | |
| ```elixir | |
| repo_input = Kino.Input.text("HuggingFace repo") | |
| ``` | |
| ```elixir | |
| repo = Kino.Input.read(repo_input) | |
| if repo == "" do | |
| Kino.interrupt!(:normal, "Enter repository.") | |
| end | |
| ``` | |
| ```elixir | |
| response = | |
| Req.post!("https://huggingface.co/api/models/#{repo}/paths-info/main", | |
| json: %{paths: ["tokenizer.json"]} | |
| ) | |
| case response do | |
| %{status: 200, body: []} -> | |
| :ok | |
| %{status: 200, body: [%{"path" => "tokenizer.json"}]} -> | |
| Kino.interrupt!(:error, "The tokenizer.json file already exist in the given repository.") | |
| _ -> | |
| Kino.interrupt!(:error, "The repository does not exist or requires authentication.") | |
| end | |
| ``` | |
| ```elixir | |
| output_dir = Path.join(System.tmp_dir!(), repo) | |
| ``` | |
| ````elixir | |
| script = """ | |
| import sys | |
| from transformers import AutoTokenizer | |
| repo = sys.argv[1] | |
| output_dir = sys.argv[2] | |
| try: | |
| tokenizer = AutoTokenizer.from_pretrained(repo) | |
| assert tokenizer.is_fast | |
| tokenizer.save_pretrained(output_dir) | |
| except Exception as error: | |
| print(error) | |
| exit(1) | |
| """ | |
| case System.cmd("python", ["-c", script, repo, output_dir]) do | |
| {_, 0} -> | |
| :ok | |
| {output, _} -> | |
| Kino.Markdown.new(""" | |
| ``` | |
| #{output} | |
| ``` | |
| """) | |
| |> Kino.render() | |
| Kino.interrupt!(:error, "Tokenizer conversion failed.") | |
| end | |
| ```` | |
| ```elixir | |
| tokenizer_path = Path.join(output_dir, "tokenizer.json") | |
| Kino.Download.new( | |
| fn -> File.read!(tokenizer_path) end, | |
| filename: "tokenizer.json", | |
| label: "tokenizer.json" | |
| ) | |
| ``` | |
| `````elixir | |
| Kino.Markdown.new(""" | |
| ### Next steps | |
| 1. Go to https://huggingface.co/#{repo}/upload/main. | |
| 2. Upload the `tokenizer.json` file. | |
| 3. Add the following description: | |
| ````markdown | |
| Generated with: | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("#{repo}") | |
| assert tokenizer.is_fast | |
| tokenizer.save_pretrained("...") | |
| ``` | |
| ```` | |
| 4. Submit the PR. | |
| """) | |
| ````` | |