UnicodeNormalize / String Layer

Normalize Unicode strings to a standardized form. Similar to Python's unicodedata.normalize() or Java's Normalizer class. Ensures consistent representation of characters that can be written in multiple ways.

Common applications:

  • Text search optimization
  • Database indexing
  • Multilingual text processing
  • Character matching
  • Data cleaning
  • Cross-platform compatibility
  • International data standardization

Example transformations:

  • Combining marks: é (e + ´) → é (single character)
  • Ligatures: ffi → ffi
  • Special characters: ⁵ → 5 (with NFKC/NFKD)
  • Width variations: 2024 → 2024 (with NFKC/NFKD)
Table
0
0
Table

Select

column

The string column to normalize. Handles various Unicode text including:

  • Accented characters (é, ñ, ü)
  • Combined characters (한글, འབྲུག་ཡུལ)
  • Special formats (superscripts, fullwidth)
  • Compatibility characters (ffi, ℮, 2)

The normalization form to apply (UnicodeForm). Choose based on your needs:

  • NFC: General purpose, maintains appearance
  • NFKC: Search and comparison, standardizes variants
  • NFD: Analysis and processing, separates components
  • NFKD: Maximum compatibility, thorough decomposition
NFC ~

Canonical Decomposition followed by Canonical Composition (NFC). Most compact form, recommended for general use. Examples:

  • Combines characters: e + ◌́ → é
  • Maintains visual equivalence
  • Preserves semantic meaning
  • Most compatible with legacy systems
NFKC ~

Compatibility Decomposition followed by Canonical Composition (NFKC). More aggressive normalization, best for search/comparison. Examples:

  • Converts special characters: ℮ → e
  • Standardizes presentation forms
  • Resolves compatibility characters
  • May lose styling information
NFD ~

Canonical Decomposition (NFD). Decomposes characters into their constituent parts. Useful for certain algorithms. Examples:

  • Splits combined characters: é → e + ◌́
  • Maintains reversibility
  • Preserves character relationships
  • Good for detailed text analysis
NFKD ~

Compatibility Decomposition (NFKD). Most thorough decomposition, best for maximum compatibility. Examples:

  • Full character decomposition
  • Converts all compatibility characters
  • Maximum interoperability
  • May significantly alter appearance

Name for the new column. If not provided, the system generates a unique name. If AsColumn matches an existing column, the existing column is replaced. The name should follow valid column naming conventions.