wordchipper 0.9 is out!

wordchipper is a high-performance Rust byte-pair encoder tokenizer for the OpenAI GPT-2 tokenizer family. With throughput speedups relative to tiktoken-rs in rust on a 64 core machine of ~4.3-5.7x (4 to 64 cores) for general regex BPE vocabularies, and ~6.9x-9.2x when using custom DFA lexers for specific OpenAI vocabularies. Under python wrappers, we see a range of ~2x-4x (4 to 64 cores) speedups over tiktoken.

We’re publishing a paper on this work: