Skip to content

Token Filters

Token filters transform or remove tokens after tokenization.

Base Class

TokenFilter

Abstract base class for implementing custom token filters.

Case Transformation

LowerCaseFilter

Converts all tokens to lowercase for case-insensitive search.

UpperCaseFilter

Converts all tokens to uppercase.

Character Normalization

CJKWidthFilter

Normalizes CJK (Chinese, Japanese, Korean) character widths.

ASCIIFoldingFilter

Converts accented characters to their ASCII equivalents.

Stop Words and Filtering

StopFilter

Removes stop words from the token stream.

KeepWordFilter

Keeps only tokens that match a specified word list.

Length Filtering

LengthFilter

Filters out tokens outside a specified length range.

TruncateFilter

Truncates tokens exceeding a specified length.

LimitTokenCountFilter

Limits the total number of tokens output.

String Manipulation

TrimFilter

Removes leading and trailing whitespace from tokens.

ReverseStringFilter

Reverses each token character-by-character.

Stemming

PorterStemFilter

Applies Porter stemming algorithm for English.

SnowballFilter

Applies Snowball stemming algorithm for multiple languages.

Language-Specific

ElisionFilter

Removes elisions (e.g., l', d' in French).

ApostropheFilter

Removes apostrophes and text after apostrophes.

EnglishPossessiveFilter

Removes English possessive suffixes ('s).

N-Grams and Shingles

NGramTokenFilter

Generates n-grams from tokens at the token level.

EdgeNGramTokenFilter

Generates edge n-grams from the beginning of each token.

ShingleFilter

Creates word shingles (multi-word tokens) from consecutive tokens.

CJKBigramFilter

Forms bigrams of CJK (Chinese, Japanese, Korean) characters.

CommonGramsFilter

Generates bigrams for frequently occurring terms.

Deduplication

UniqueFilter

Removes duplicate tokens from the token stream.

RemoveDuplicatesTokenFilter

Removes duplicate tokens at the same position.

Keyword Handling

KeywordRepeatFilter

Marks tokens as keywords and repeats them for dual processing.

KeywordMarkerFilter

Marks specified tokens as keywords to protect them from stemming.

Conditional

ConditionalTokenFilter

Applies a token filter conditionally based on a predicate function.

Released under the MIT License.