Skip to content

ICUNormalizer

Unicode text normalization using ICU normalization forms.

Import

typescript
import ICUNormalizer from 'dynamosearch/char_filters/ICUNormalizer';

Constructor

typescript
new ICUNormalizer(options?: { name?: 'nfc' | 'nfkc'; mode?: 'compose' | 'decompose' })

Parameters

  • name ('nfc' | 'nfkc', optional) - Normalization form (default: 'nfkc')
  • mode ('compose' | 'decompose', optional) - Composition mode (default: 'compose')

Normalization Forms

The filter supports four Unicode normalization forms:

namemodeResultDescription
'nfc''compose'NFCCanonical Composition
'nfc''decompose'NFDCanonical Decomposition
'nfkc''compose'NFKCCompatibility Composition
'nfkc''decompose'NFKDCompatibility Decomposition

Examples

typescript
const filter = new ICUNormalizer();
// or
const filter = new ICUNormalizer({ name: 'nfkc', mode: 'compose' });

// Normalizes various Unicode representations to standard forms
filter.apply('file'); // 'file' (ligature normalized)
filter.apply('½'); // '1⁄2' (fraction)
filter.apply('²'); // '2' (superscript)
filter.apply('ABC'); // 'ABC' (full-width to half-width)

NFC (Canonical Composition)

typescript
const filter = new ICUNormalizer({ name: 'nfc', mode: 'compose' });

// Combines decomposed characters
filter.apply('café'); // 'café' (é as single character)
filter.apply('naïve'); // 'naïve' (ï as single character)

NFD (Canonical Decomposition)

typescript
const filter = new ICUNormalizer({ name: 'nfc', mode: 'decompose' });

// Decomposes combined characters
filter.apply('café'); // 'café' (é as e + combining acute)
filter.apply('naïve'); // 'naïve' (ï as i + combining diaeresis)

NFKD (Compatibility Decomposition)

typescript
const filter = new ICUNormalizer({ name: 'nfkc', mode: 'decompose' });

filter.apply('²'); // '2' (superscript normalized)
filter.apply('fi'); // 'fi' (ligature decomposed)

Best For

  • NFKC: General search applications (normalizes ligatures, superscripts, full-width chars)
  • NFC: Text storage and display (canonical representation)
  • NFD: Accent-insensitive search (combine with accent stripping)
  • NFKD: Maximum normalization (compatibility + decomposition)

Use with Analyzers

typescript
import Analyzer from 'dynamosearch/analyzers/Analyzer';
import ICUNormalizer from 'dynamosearch/char_filters/ICUNormalizer';
import StandardTokenizer from 'dynamosearch/tokenizers/StandardTokenizer';
import LowerCaseFilter from 'dynamosearch/filters/LowerCaseFilter';

class NormalizedAnalyzer extends Analyzer {
  constructor() {
    super({
      charFilters: [new ICUNormalizer()],
      tokenizer: new StandardTokenizer(),
      filters: [new LowerCaseFilter()],
    });
  }
}

const analyzer = new NormalizedAnalyzer();
const tokens = await analyzer.analyze('file café');
// Normalizes Unicode before tokenization

See Also

Released under the MIT License.