jaco-bro/tokenizer
BPE tokenizer for LLMs in Zig
A Zig library for tokenizing text using PCRE2 regular expressions - now also available as a Python package via pip
.
zig v0.13.0
git clone https://github.com/jaco-bro/tokenizer
cd tokenizer
zig build exe --release=fast
zig-out/bin/tokenizer_exe [--model MODEL_NAME] COMMAND INPUT
zig build run -- [--model MODEL_NAME] COMMAND INPUT
zig build run -- --encode "hello world"
zig build run -- --decode "{14990, 1879}"
zig build run -- --model "phi-4-4bit" --encode "hello world"
zig build run -- --model "phi-4-4bit" --decode "15339 1917"
Tokenizer is also pip-installable for use from Python:
pip install tokenizerz
python
Usage:
>>> import tokenizerz
>>> tokenizer = tokenizerz.Tokenizer()
File 'Qwen2.5-Coder-1.5B-4bit/tokenizer.json' already exists. Skipping download.
All files already exist. No download needed.
>>> tokens = tokenizer.encode("Hello, world!")
>>> print(tokens)
[9707, 11, 1879, 0]
>>> tokenizer.decode(tokens)
'Hello, world!'
>>> exit()
Shell:
bpe --encode "hello world"