estimate_tokens()

Estimate the token count for a string using a character-based heuristic.

Usage

estimate_tokens(text)

Uses the approximation of 1 token ≈ 4 characters for English text, which aligns with typical BPE tokenizers (GPT, Claude, Llama). For non-English or code-heavy text, this may undercount slightly.

Parameters

text: str: The text to estimate tokens for.

Returns

int: Estimated token count (always at least 1 for non-empty text).

Examples

import talk_box as tb

tokens = tb.estimate_tokens("Hello, world!")
print(tokens)  # ~4