comment on

Node display problem: I really wanted to post this earlier, but was having a hard time getting past how PerlMonks obliterate the UTF-8 literal within code tags. In specific, the line starting with my $string = "... should contain a string of the following characters: GREEK SMALL LETTER LAMDA, GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OXIA, GREEK SMALL LETTER ALPHA WITH VARIA, GREEK SMALL LETTER OMICRON WITH VARIA, GREEK SMALL LETTER FINAL SIGMA. You'll have to paste them into the code yourself (sorry).

In other words, line 21 should look like this: my $string = "λαάὰὸς";, but that can't be displayed within code tags.

I believe the example code below answers both of your questions:

# If using a Perl version prior to v5.16, comment out the "use feature
+" line,
# and uncomment the BEGIN{...} block.

use feature ':5.16';

#BEGIN {
#  die "Must install Unicode::CaseFold." if ! eval "use Unicode::CaseF
+old; 1;";
#}

use strict;
use warnings FATAL => 'utf8';
use utf8;
use charnames ':full';

use Unicode::Normalize qw(NFD NFC);


binmode STDOUT, ':encoding(UTF-8)';


my $string = "&#955;&#945;&#8049;&#8048;&#8056;&#962;";

while ( $string =~ m/(?<grapheme>\X)/g ) {
  my $grapheme  = $+{grapheme};
  print explain( $+{grapheme} ), "\n";
}

sub explain {
  my $grapheme = shift;
  my %pri = decompose( $grapheme );
  my %base = decompose( $pri{base} );
  my $output = <<"END_OUTPUT";
Grapheme:($grapheme)
    Dec, Hex, Name:           [$pri{cp}], [$pri{hex_str}], '$pri{name}
+'
    Case: (Fold,Lower,Upper): ($pri{fc}), ($pri{lc}), ($pri{uc})
    Grapheme Base:            ($pri{base}), [$base{hex_str}], '$base{n
+ame}'
END_OUTPUT
  foreach my $extend ( @{$pri{comb}} ) {
    my %ext = decompose( $extend );
    my $grapheme = fc $ext{grapheme};
    $output .= <<"END_OUTPUT";
    Combining Mark: ($grapheme )
        Dec, Hex, Name: [$ext{cp}], [$ext{hex_str}], '$ext{name}'
END_OUTPUT
  }
  return $output;
}

sub decompose {
  my $grapheme = shift;
  my $decomp   = NFD( $grapheme );
  my $cp       = ord $grapheme;
  my ( $base ) = substr($decomp, 0, 1 );
  my ( @comb ) = map { substr $decomp, $_, 1 } 1 .. length($decomp)-1;
  return (
    grapheme => $grapheme,
    cp       => $cp,
    hex_str  => sprintf( "%#0.4x", $cp ),
    name     => charnames::viacode( $cp ),
    lc       => lc $grapheme,
    uc       => uc $grapheme,
    fc       => fc $grapheme,
    base     => $base,
    comb     => [ @comb ],
  );
}
[download]

I won't post the output, as the Monastery seems will trash the target graphemes within code tags. For those without the ambition to run it, it will display the grapheme, its code point and name, and then the decomposed base and combining characters graphemes, code points, and names.

The first question you're asking can be accomplished by matching the grapheme cluster with \X, obtaining its code point, and then calling charnames::viacode on it.

The second question you're asking deals with decomposing the grapheme. Unicode::Normalize provides NFD, which is "normalize formed by canonical decomposition". This function decomposes graphemes into their base character, followed by its combining marks. It places them into a reliable order too. substr and length will treat a decomposed string as being of a length equal to all the base characters plus all the combining marks.

If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking".

Dave

In reply to Re: getting Unicode character names from string by davido
in thread getting Unicode character names from string by csthflk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.