UnicodeNormalize / String Layer
Normalize Unicode strings to a standardized form. Similar to Python's unicodedata.normalize() or Java's Normalizer class. Ensures consistent representation of characters that can be written in multiple ways.
Common applications:
- Text search optimization
- Database indexing
- Multilingual text processing
- Character matching
- Data cleaning
- Cross-platform compatibility
- International data standardization
Example transformations:
- Combining marks: é (e + ´) → é (single character)
- Ligatures: ffi → ffi
- Special characters: ⁵ → 5 (with NFKC/NFKD)
- Width variations: 2024 → 2024 (with NFKC/NFKD)
Select
columnThe string column to normalize. Handles various Unicode text including:
- Accented characters (é, ñ, ü)
- Combined characters (한글, འབྲུག་ཡུལ)
- Special formats (superscripts, fullwidth)
- Compatibility characters (ffi, ℮, 2)
UnicodeForm
enumThe normalization form to apply (UnicodeForm
). Choose based on your needs:
- NFC: General purpose, maintains appearance
- NFKC: Search and comparison, standardizes variants
- NFD: Analysis and processing, separates components
- NFKD: Maximum compatibility, thorough decomposition
Canonical Decomposition followed by Canonical Composition (NFC). Most compact form, recommended for general use. Examples:
- Combines characters: e + ◌́ → é
- Maintains visual equivalence
- Preserves semantic meaning
- Most compatible with legacy systems
Compatibility Decomposition followed by Canonical Composition (NFKC). More aggressive normalization, best for search/comparison. Examples:
- Converts special characters: ℮ → e
- Standardizes presentation forms
- Resolves compatibility characters
- May lose styling information
Canonical Decomposition (NFD). Decomposes characters into their constituent parts. Useful for certain algorithms. Examples:
- Splits combined characters: é → e + ◌́
- Maintains reversibility
- Preserves character relationships
- Good for detailed text analysis
Compatibility Decomposition (NFKD). Most thorough decomposition, best for maximum compatibility. Examples:
- Full character decomposition
- Converts all compatibility characters
- Maximum interoperability
- May significantly alter appearance
AsColumn
nameName for the new column. If not provided, the system generates a unique name. If AsColumn
matches an existing column, the existing column is replaced. The name should follow valid column naming conventions.