in reply to Annoying Regex issue with forign chars

The problem is the \b (match word break) doesn't recognise é as a word (\w) character. You can check that quickly by trying $desc =~ s|\w|X|sg; which Xs out all the letters except the accented ones.

Because of the limited information about the larger context it's a little hard to advise a best solution. It may be that your match doesn't require the word boundary check or it may be that can use a look behind check for white space ((?<=\s)) to achieve the same effect.


Perl reduces RSI - it saves typing

Replies are listed 'Best First'.
Re^2: Annoying Regex issue with forign chars
by ultranerds (Hermit) on Nov 05, 2008 at 12:06 UTC
    Hi, Thanks for the replies :) I was expecting an email response - so good thing I checked back to see if there were any replies :) I've tried:
    sub GetAdvertsForLink { use strict; use warnings; use utf8; binmode STDOUT, ':encoding(UTF-8)'; $_ = 'été'; my $desc = "test 123 un été à Tanger. élargir elargir ete"; $desc =~ s|\b\Q$_|<a href="url">$_</a>|sg; print "FOO" . $desc; }
    You can see this in operation here: http://www.sudimedia.com/cgi-bin/link/page.cgi?g=Detailed%2F3.html;d=1 The problem now seems: 1) Its doesn't link at all 2) The charachters come up as weird charachters (in FF), and blank spaces in IE 7) If this is any help - the whole function is:
    #<%Plugins::SponsorText::GetAdvertsForLink($Description,$category_id)% +> sub GetAdvertsForLink2 { use locale; my $desc = $_[0]; my $link_id = $_[1]; my $cat_tbl = $DB->table('CatLinks'); $cat_tbl->select_options('LIMIT 1'); my $cat_id = $cat_tbl->select( ['CategoryID'] , { LinkID => $link_ +id } )->fetchrow; my $cond = new GT::SQL::Condition; $cond->add('CatIDs','LIKE',"%,$cat_id,%"); $cond->add('CatIDs','LIKE',"%,$cat_id"); $cond->add('CatIDs','LIKE',"$cat_id,%"); $cond->add('CatIDs','LIKE',"$cat_id"); $cond->bool('OR'); my @words; print $IN->header; # print qq|GOT CAT ID: $cat_id|; # use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset); + my $sth = $DB->table('SponsorLinkText')->select( $cond ) || die $GT +::SQL::error; while (my $hit = $sth->fetchrow_hashref) { my (@words) = split /,/, $hit->{Words}; $hit->{Target} ||= '_blank'; foreach (@words) { print qq|Looking for "$_" in "$desc" <br />\n|; if ($hit->{TheText}) { $desc =~ s|\b$_|<a href="$hit->{LinkURL}" title="$hit->{Th +eText}" target="$hit->{Target}">$_</a>|sg; } else { $desc =~ s|\b$_|<a href="$hit->{LinkURL}" target="$hit->{T +arget}">$_</a>|sg; } } } return $desc; }
    Its for a program called Gossamer Links. Basically, this is how it works: 1) Grabs entries from the SponsorLinkText table, and gets the words to link/the link itself, and a few other bits of info 2) Split the words up - and then does:
    foreach (@words) { print qq|Looking for "$_" in "$desc" <br />\n|; if ($hit->{TheText}) { $desc =~ s|\b$_|<a href="$hit->{LinkURL}" title="$hit->{Th +eText}" target="$hit->{Target}">$_</a>|sg; } else { $desc =~ s|\b$_|<a href="$hit->{LinkURL}" target="$hit->{T +arget}">$_</a>|sg; } }
    (which basically links the words to their appropriate letters - at least in theory =)) TIA for any more suggestions. Andy
Re^2: Annoying Regex issue with forign chars
by ultranerds (Hermit) on Nov 05, 2008 at 12:12 UTC
    Hi, Also, regarding your suggestion of: $desc =~ s|(<=\s)$_(<=\s)|<a href="$hit->{LinkURL}" target="$hit->{Target}">$_</a>|sg; ...is that what you mean? TIA Andy
      Hi, Ok, this seems to be working ok (and should do what I need);

      $desc =~ s|([\s\.]?)$_([\s\.]?)|$1<a href="$hit->{LinkURL}" title="$hit->{TheText}" target="$hit->{Target}">$_</a>$2|sg;


      Thanks for the help guys :)

      Cheers Andy
        Bugger, that had an undesired effect :(

        It was converting words insde the <a href, and giving really weird output.

        Only way I eventually got it to work was wit this bit of messy code:
        if ($hit->{TheText}) { $desc =~ s|\Q $_ | <a href="$hit->{LinkURL}" title="$hit-> +{TheText}" target="$hit->{Target}">$_</a> |sg; $desc =~ s|\Q $_\E$| <a href="$hit->{LinkURL}" title="$hit +->{TheText}" target="$hit->{Target}">$_</a>|sg; $desc =~ s|\Q $_,| <a href="$hit->{LinkURL}" title="$hit-> +{TheText}" target="$hit->{Target}">$_</a>, |sg; $desc =~ s|\Q. $_ |. <a href="$hit->{LinkURL}" title="$hit +->{TheText}" target="$hit->{Target}">$_</a> |sg; } else { $desc =~ s|\Q $_ | <a href="$hit->{LinkURL}" target="$hit- +>{Target}">$_</a> |sg; $desc =~ s|\Q $_\E$| <a href="$hit->{LinkURL}" target="$hi +t->{Target}">$_</a>|sg; $desc =~ s|\Q $_, | <a href="$hit->{LinkURL}" target="$hit +->{Target}">$_</a>, |sg; $desc =~ s|\Q. $_ |. <a href="$hit->{LinkURL}" target="$hi +t->{Target}">$_</a> |sg; }


        Cheers

        Andy