Token Filters
Token filters transform or remove tokens after tokenization.
Base Class
TokenFilter
Abstract base class for implementing custom token filters.
Case Transformation
LowerCaseFilter
Converts all tokens to lowercase for case-insensitive search.
UpperCaseFilter
Converts all tokens to uppercase.
Character Normalization
CJKWidthFilter
Normalizes CJK (Chinese, Japanese, Korean) character widths.
ASCIIFoldingFilter
Converts accented characters to their ASCII equivalents.
Stop Words and Filtering
StopFilter
Removes stop words from the token stream.
KeepWordFilter
Keeps only tokens that match a specified word list.
Length Filtering
LengthFilter
Filters out tokens outside a specified length range.
TruncateFilter
Truncates tokens exceeding a specified length.
LimitTokenCountFilter
Limits the total number of tokens output.
String Manipulation
TrimFilter
Removes leading and trailing whitespace from tokens.
ReverseStringFilter
Reverses each token character-by-character.
Stemming
PorterStemFilter
Applies Porter stemming algorithm for English.
SnowballFilter
Applies Snowball stemming algorithm for multiple languages.
Language-Specific
ElisionFilter
Removes elisions (e.g., l', d' in French).
ApostropheFilter
Removes apostrophes and text after apostrophes.
EnglishPossessiveFilter
Removes English possessive suffixes ('s).
N-Grams and Shingles
NGramTokenFilter
Generates n-grams from tokens at the token level.
EdgeNGramTokenFilter
Generates edge n-grams from the beginning of each token.
ShingleFilter
Creates word shingles (multi-word tokens) from consecutive tokens.
CJKBigramFilter
Forms bigrams of CJK (Chinese, Japanese, Korean) characters.
CommonGramsFilter
Generates bigrams for frequently occurring terms.
Deduplication
UniqueFilter
Removes duplicate tokens from the token stream.
RemoveDuplicatesTokenFilter
Removes duplicate tokens at the same position.
Keyword Handling
KeywordRepeatFilter
Marks tokens as keywords and repeats them for dual processing.
KeywordMarkerFilter
Marks specified tokens as keywords to protect them from stemming.
Conditional
ConditionalTokenFilter
Applies a token filter conditionally based on a predicate function.