wordchipper 0.7.3
Announcing: wordchipper 0.7.3
- much work on docs / interface unification.
- 2x
tiktoken-rsspeed on non-harmony models (it’s the nasty regex)
This is a high-performance rust BPE tokenizer trainer/encoder/decoder.
This is ready for alpha users, and is 2x the speed of tiktoken-rs
for many current models.
The productionization towards an LTR stable release can be tracked in the Alpha Release Tracking Issue.
| Model | wordchipper | tiktoken-rs | tokenizers |
|---|---|---|---|
| r50k_base | 239.19 MiB/s | 169.30 MiB/s | 22.03 MiB/s |
| p50k_base | 250.55 MiB/s | 163.07 MiB/s | 22.23 MiB/s |
| p50k_edit | 241.69 MiB/s | 169.76 MiB/s | 21.27 MiB/s |
| cl100k_base | 214.26 MiB/s | 125.43 MiB/s | 21.62 MiB/s |
| o200k_base | 119.49 MiB/s | 123.75 MiB/s | 22.03 MiB/s |
| o200k_harmony | 121.80 MiB/s | 121.54 MiB/s | 22.08 MiB/s |
Loading pre-trained models:
use std::sync::Arc;
use wordchipper::{
get_model,
TokenDecoder,
TokenEncoder,
UnifiedTokenVocab,
disk_cache::WordchipperDiskCache,
};
fn example() -> anyhow::Result<(
Arc<dyn TokenEncoder<u32>>,
Arc<dyn TokenDecoder<u32>>
)> {
let mut disk_cache = WordchipperDiskCache::default();
let vocab: UnifiedTokenVocab<u32> =
get_model("openai/o200k_harmony", &mut disk_cache)?;
let encoder = vocab.to_default_encoder();
let decoder = vocab.to_default_decoder();
Ok((encoder, decoder))
}