ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Got a bit of a weird issue here; Code:
$_ => été $desc => "test 123 un été à Tanger. élargir elargir ete" $desc =~ s|\b\Q$_|<a href="$hit->{LinkURL}" title="$hit->{TheText}" ta +rget="$hit->{Target}">$_</a>|sg;
($hit is a hash from earlier on in my code ) If I change $_ to "un" (i.e just normal letters), then the regex works absolutly perfect. However, with the accented charachters, it screws up - and doesn't do the search + replace. NB: I've also been recommended to use: use locale; ...from another forum - but that doesn't seem to fix it for me :( Also, another important note - it seems that this:
$_ => téé $desc => "test 123 un été à Tanger. élargir elargir téé"
i.e starts with a normal char, is absolutly fine - and it works with that - but just not with words where they start with accented charachters :( I've been recommended to you guys, as apparantly you're the best of the best <G> Anyone got any suggestions?

TIA

Andy

Replies are listed 'Best First'.
Re: Annoying Regex issue with foreign chars
by moritz (Cardinal) on Nov 05, 2008 at 08:05 UTC
    You need to decode the string first, then non-ASCII characters are identified correctly as being a word character or not. String literals are automatically decoded by the utf8 pragma:
    use strict; use warnings; use utf8; binmode STDOUT, ':encoding(UTF-8)'; $_ = 'été'; my $desc = "test 123 un été à Tanger. élargir elargir ete"; $desc =~ s|\b\Q$_|<a href="url">$_</a>|sg; print $desc, $/; __END__ test 123 un <a href="url">été</a> à Tanger. élargir elargir ete

    I've described that in much more detail in this article, and you could also read perluniintro, the Encode documentation, perlunifaq and perlunicode.

Re: Annoying Regex issue with forign chars
by GrandFather (Saint) on Nov 05, 2008 at 08:15 UTC

    The problem is the \b (match word break) doesn't recognise é as a word (\w) character. You can check that quickly by trying $desc =~ s|\w|X|sg; which Xs out all the letters except the accented ones.

    Because of the limited information about the larger context it's a little hard to advise a best solution. It may be that your match doesn't require the word boundary check or it may be that can use a look behind check for white space ((?<=\s)) to achieve the same effect.


    Perl reduces RSI - it saves typing
      Hi, Thanks for the replies :) I was expecting an email response - so good thing I checked back to see if there were any replies :) I've tried:
      sub GetAdvertsForLink { use strict; use warnings; use utf8; binmode STDOUT, ':encoding(UTF-8)'; $_ = 'été'; my $desc = "test 123 un été à Tanger. élargir elargir ete"; $desc =~ s|\b\Q$_|<a href="url">$_</a>|sg; print "FOO" . $desc; }
      You can see this in operation here: http://www.sudimedia.com/cgi-bin/link/page.cgi?g=Detailed%2F3.html;d=1 The problem now seems: 1) Its doesn't link at all 2) The charachters come up as weird charachters (in FF), and blank spaces in IE 7) If this is any help - the whole function is:
      #<%Plugins::SponsorText::GetAdvertsForLink($Description,$category_id)% +> sub GetAdvertsForLink2 { use locale; my $desc = $_[0]; my $link_id = $_[1]; my $cat_tbl = $DB->table('CatLinks'); $cat_tbl->select_options('LIMIT 1'); my $cat_id = $cat_tbl->select( ['CategoryID'] , { LinkID => $link_ +id } )->fetchrow; my $cond = new GT::SQL::Condition; $cond->add('CatIDs','LIKE',"%,$cat_id,%"); $cond->add('CatIDs','LIKE',"%,$cat_id"); $cond->add('CatIDs','LIKE',"$cat_id,%"); $cond->add('CatIDs','LIKE',"$cat_id"); $cond->bool('OR'); my @words; print $IN->header; # print qq|GOT CAT ID: $cat_id|; # use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset); + my $sth = $DB->table('SponsorLinkText')->select( $cond ) || die $GT +::SQL::error; while (my $hit = $sth->fetchrow_hashref) { my (@words) = split /,/, $hit->{Words}; $hit->{Target} ||= '_blank'; foreach (@words) { print qq|Looking for "$_" in "$desc" <br />\n|; if ($hit->{TheText}) { $desc =~ s|\b$_|<a href="$hit->{LinkURL}" title="$hit->{Th +eText}" target="$hit->{Target}">$_</a>|sg; } else { $desc =~ s|\b$_|<a href="$hit->{LinkURL}" target="$hit->{T +arget}">$_</a>|sg; } } } return $desc; }
      Its for a program called Gossamer Links. Basically, this is how it works: 1) Grabs entries from the SponsorLinkText table, and gets the words to link/the link itself, and a few other bits of info 2) Split the words up - and then does:
      foreach (@words) { print qq|Looking for "$_" in "$desc" <br />\n|; if ($hit->{TheText}) { $desc =~ s|\b$_|<a href="$hit->{LinkURL}" title="$hit->{Th +eText}" target="$hit->{Target}">$_</a>|sg; } else { $desc =~ s|\b$_|<a href="$hit->{LinkURL}" target="$hit->{T +arget}">$_</a>|sg; } }
      (which basically links the words to their appropriate letters - at least in theory =)) TIA for any more suggestions. Andy
      Hi, Also, regarding your suggestion of: $desc =~ s|(<=\s)$_(<=\s)|<a href="$hit->{LinkURL}" target="$hit->{Target}">$_</a>|sg; ...is that what you mean? TIA Andy
        Hi, Ok, this seems to be working ok (and should do what I need);

        $desc =~ s|([\s\.]?)$_([\s\.]?)|$1<a href="$hit->{LinkURL}" title="$hit->{TheText}" target="$hit->{Target}">$_</a>$2|sg;


        Thanks for the help guys :)

        Cheers Andy