Tokenizers
Tokenizers split text into individual tokens for indexing and searching.
Available Tokenizers
Tokenizer
Abstract base class for implementing custom tokenizers.
StandardTokenizer
Word-based tokenization that splits on hyphens, spaces, commas, and periods.
Best for: English and Western languages, general text tokenization, word-based search
IntlSegmenterTokenizer
Locale-aware word segmentation using JavaScript Intl.Segmenter API.
Best for: Multilingual text, languages without spaces (Chinese, Japanese, Thai), locale-specific word boundaries
KeywordTokenizer
Returns the entire input as a single token.
Best for: Exact string matching, IDs and identifiers, categories and tags
LetterTokenizer
Splits on non-letter characters using Unicode letter property.
Best for: Extracting letter sequences, international text, when numbers/punctuation should be removed
LowerCaseTokenizer
Splits on non-letter characters and lowercases each token.
Best for: Case-insensitive search, letter-only tokenization, simple text analysis
WhitespaceTokenizer
Splits text on whitespace characters.
Best for: Preserving punctuation, pre-tokenized input, space-delimited data
NGramTokenizer
Generates character n-grams for partial matching.
Best for: Partial/substring matching, autocomplete suggestions, fuzzy matching, search-as-you-type
PathHierarchyTokenizer
Splits paths into hierarchical components.
Best for: File system paths, URL paths, package names, hierarchical identifiers
PatternTokenizer
Flexible regex-based tokenization with split or capture modes.
Best for: Custom tokenization patterns, complex text parsing, domain-specific formats
SimplePatternTokenizer
Captures text matching a pattern as tokens.
Best for: Extracting specific patterns, number extraction, simple pattern matching
SimplePatternSplitTokenizer
Splits input at pattern matches.
Best for: Custom delimiters, CSV-like data, simple splitting logic
URLEmailTokenizer
Preserves URLs and email addresses as complete tokens.
Best for: Content with URLs, email address extraction, web content indexing