perl - Unicode::Normalize - query about the 'Normalization From' -
#!/usr/local/bin/perl use warnings; use 5.014; use unicode::normalize qw(nfd nfc compose); $string1 = "\x{f5}"; $nfd_string1 = nfd( $string1 ); # pv = 0x831150 "o\314\203"\0 [utf8 "o\x{303}"] * $composed_nfd_string1 = compose( $nfd_string1 ); # pv = 0x77bc40 "\303\265"\0 [utf8 "\x{f5}"] * $nfc_string1 = nfc( $string1 ); # pv = 0x836e30 "\303\265"\0 [utf8 "\x{f5}"] * $string2 = "o\x{303}"; $nfd_string2 = nfd( $string2 ); # pv = 0x780da0 "o\314\203"\0 [utf8 "o\x{303}"] * $composed_nfd_string2 = compose( $nfd_string2 ); # pv = 0x782dc0 "\303\265"\0 [utf8 "\x{f5}"] * $nfc_string2 = nfc( $string2 ); # pv = 0x7acba0 "\303\265"\0 [utf8 "\x{f5}"] * # * devel::peek::dump output 'ok' if $nfd_string1 eq $nfd_string2; 'ok' if $nfc_string1 eq $nfc_string2;
output:
ok
ok
after trying asked me: there reason use normalization form d
instead of normalization form c
?
not has composite form, , nfc nfd first. part of nfd putting continuation characters in order after starter character can compare 2 grapheme clusters (the fancy name starter along continuation characters) see if same. doing in example, should same answers, nfc more work.
there couple of reasons things don't have special nfc version. many of came historical character sets. composed version of é there make latin-1 people happy. there's e , ´ versions designed allow build grapheme on own. there many ways that, , it's not accents , diacriticals. grapheme clusters can have several of continuation characters, , build them yourself, can put them in order (for whatever reason). however, have assigned weights. nfd reorder them weights can compare 2 grapheme clusters despite order used.
it's in unicode technical report 15, daxim said in comment. you'll want see diagrams , read around part says:
once string has been decomposed, sequences of combining marks contains put well-defined order. rearrangement of combining marks done according subpart of unicode normalization algorithm known canonical ordering algorithm. algorithm sorts sequences of combining marks based on value of canonical_combining_class (ccc) property, values defined in unicodedata.txt. characters (including non-combining marks) have canonical_combining_class value of zero, , unaffected canonical ordering algorithm. such characters referred special term, starter. subset of combining marks have non-zero canonical_combining_class property values subject potential reordering canonical ordering algorithm. characters called non-starters.
some things explicitly use nfd data, such hfs+ file system. doesn't matter in many cases because programming language binds library functions transforms filename strings right form.
sometime later today i'll uploading unicode::support demonstrates many of these things.
and, later today, tom come along , school all. :)
Comments
Post a Comment