frasco has asked for the wisdom of the Perl Monks concerning the following question:

By retieving data from my database and formatting the output in a html page I have the following problem. I have fields composed both by uppercase and lowercase characters, but I need that lowercase character would appear as italics and uppercase as lowercase.
That is lowercase => lowercase italics and uppercase => normal lowercase.
I tried with regex but some problem rise since my charset is utf8.

Replies are listed 'Best First'.
Re: formatting my html output
by Fletch (Bishop) on May 02, 2008 at 13:31 UTC

    Well it's obvious the problem is on line 17. See How (Not) To Ask A Question.

    Also look at [:upper:] and [:lower:] in perlre which should be locale-aware given the proper setup.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: formatting my html output
by almut (Canon) on May 02, 2008 at 13:44 UTC

    Could you post some sample data together with the code that you've tried? What you need to do should in principle also work with Unicode/UTF-8.

    For example, you can use \p{Lu} to match a character with the unicode property "Letter, uppercase" (for a detailed list see perlunicode, in particular section "Effects of Character Semantics").  Uppercasing and lowercasing should work as well...

    Update: to avoid unnecessary confusion, it's maybe worth mentioning that for a number of \p{...} expressions, there's the alternative [[:...:]] form. E.g. [[:upper:]] is the same as \p{IsUpper}. The \p{...} style is the more generic form, i.e. not all \p{...} expressions do have a [[:...:]] form. — BTW, the "Is"-prefix is optional, and you can use short or long forms. For example \p{IsLu} is equivalent to \p{Lu} or \p{UpercaseLetter} or \p{IsUpercaseLetter}.

Re: formatting my html output
by mr_mischief (Monsignor) on May 02, 2008 at 13:55 UTC
    Do [[:lower:]] and [[:upper:]] not work on UTF8 text? I got the impression from perlre that they do. What I read in perlretut just now seems to reinforce this. Is there some problem in the docs?

    Maybe some code that's not working would help us find your problem so we can help you turn it into working code. Miss Cleo seems to be on vacation.

Re: formatting my html output
by frasco (Beadle) on May 02, 2008 at 14:04 UTC
    sorry for my poor stile... but I'm new and confused! Anyway, by fetching data from db (making use of uncode modules) I have something like this:
     ʾà-da-um-=TÚG-:2 1 AKTUM-=TÚG 
    . (Don't worry it is a death lenguage, but intellegible) Well, if I use a regex with a-z it of couse doesn't match the small ʾ (and the accented wovels as well).

      Here's is rough sketch of how you might go about doing it:

      # your sample string my $orig = "\x{2be}\x{e0}-da-um-=T\x{da}G-:2 1 AKTUM-=T\x{da}G"; my $s = $orig; $s =~ s/(\p{Ll}+)/<i>$1<\/i>/g; # lower --> italic $s =~ s/(\p{Lu}+)/lc($1)/ge; # upper --> lower open my $fh, ">:utf8", "sample.html" or die $!; print $fh qq|<html> <header> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> </header> <body> $orig<br /> $s </body> </html> |; close $fh;

      Then load the sample.html in your browser; the second line should be the modified string.  Except for the ʾ, it appears to work. I'm not sure what the ʾ (\x{2be}) is. It doesn't seem to be treated as a lowercase character (the Unicode database lists it among "spacing modifying letters")... I'm afraid you'll have to figure that one out yourself :)