Annoying Regex issue with forign chars

ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Got a bit of a weird issue here; Code:

$_ => été 
$desc => "test 123 un été à Tanger. élargir elargir ete"

$desc =~ s|\b\Q$_|<a href="$hit->{LinkURL}" title="$hit->{TheText}" ta
+rget="$hit->{Target}">$_</a>|sg;
[download]

($hit is a hash from earlier on in my code ) If I change $_ to "un" (i.e just normal letters), then the regex works absolutly perfect. However, with the accented charachters, it screws up - and doesn't do the search + replace. NB: I've also been recommended to use: use locale; ...from another forum - but that doesn't seem to fix it for me :( Also, another important note - it seems that this:

$_ => téé 
$desc => "test 123 un été à Tanger. élargir elargir téé"
[download]

i.e starts with a normal char, is absolutly fine - and it works with that - but just not with words where they start with accented charachters :( I've been recommended to you guys, as apparantly you're the best of the best <G> Anyone got any suggestions?

TIA

Andy

Comment on Annoying Regex issue with forign chars Select or Download Code

Replies are listed 'Best First'.
Re: Annoying Regex issue with foreign chars by moritz (Cardinal) on Nov 05, 2008 at 08:05 UTC
You need to decode the string first, then non-ASCII characters are identified correctly as being a word character or not. String literals are automatically decoded by the utf8 pragma: `use strict; use warnings; use utf8; binmode STDOUT, ':encoding(UTF-8)'; $_ = 'été'; my $desc = "test 123 un été à Tanger. élargir elargir ete"; $desc =~ s\|\b\Q$_\|<a href="url">$_</a>\|sg; print $desc, $/; __END__ test 123 un <a href="url">été</a> à Tanger. élargir elargir ete` [download] I've described that in much more detail in this article, and you could also read perluniintro, the Encode documentation, perlunifaq and perlunicode.	[reply] [d/l]
Re: Annoying Regex issue with forign chars by GrandFather (Saint) on Nov 05, 2008 at 08:15 UTC
The problem is the \b (match word break) doesn't recognise é as a word (\w) character. You can check that quickly by trying `$desc =~ s\|\w\|X\|sg;` which Xs out all the letters except the accented ones. Because of the limited information about the larger context it's a little hard to advise a best solution. It may be that your match doesn't require the word boundary check or it may be that can use a look behind check for white space (`(?<=\s)`) to achieve the same effect. Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^2: Annoying Regex issue with forign chars by ultranerds (Hermit) on Nov 05, 2008 at 12:06 UTC
Hi, Thanks for the replies :) I was expecting an email response - so good thing I checked back to see if there were any replies :) I've tried: `sub GetAdvertsForLink { use strict; use warnings; use utf8; binmode STDOUT, ':encoding(UTF-8)'; $_ = 'été'; my $desc = "test 123 un été à Tanger. élargir elargir ete"; $desc =~ s\|\b\Q$_\|<a href="url">$_</a>\|sg; print "FOO" . $desc; }` [download] You can see this in operation here: http://www.sudimedia.com/cgi-bin/link/page.cgi?g=Detailed%2F3.html;d=1 The problem now seems: 1) Its doesn't link at all 2) The charachters come up as weird charachters (in FF), and blank spaces in IE 7) If this is any help - the whole function is: #<%Plugins::SponsorText::GetAdvertsForLink($Description,$category_id)% +> sub GetAdvertsForLink2 { use locale; my $desc = $_[0]; my $link_id = $_[1]; my $cat_tbl = $DB->table('CatLinks'); $cat_tbl->select_options('LIMIT 1'); my $cat_id = $cat_tbl->select( ['CategoryID'] , { LinkID => $link_ +id } )->fetchrow; my $cond = new GT::SQL::Condition; $cond->add('CatIDs','LIKE',"%,$cat_id,%"); $cond->add('CatIDs','LIKE',"%,$cat_id"); $cond->add('CatIDs','LIKE',"$cat_id,%"); $cond->add('CatIDs','LIKE',"$cat_id"); $cond->bool('OR'); my @words; print $IN->header; # print qq\|GOT CAT ID: $cat_id\|; # use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset); + my $sth = $DB->table('SponsorLinkText')->select( $cond ) \|\| die $GT +::SQL::error; while (my $hit = $sth->fetchrow_hashref) { my (@words) = split /,/, $hit->{Words}; $hit->{Target} \|\|= '_blank'; foreach (@words) { print qq\|Looking for "$_" in "$desc" <br />\n\|; if ($hit->{TheText}) { $desc =~ s\|\b$_\|<a href="$hit->{LinkURL}" title="$hit->{Th +eText}" target="$hit->{Target}">$_</a>\|sg; } else { $desc =~ s\|\b$_\|<a href="$hit->{LinkURL}" target="$hit->{T +arget}">$_</a>\|sg; } } } return $desc; } [download] Its for a program called Gossamer Links. Basically, this is how it works: 1) Grabs entries from the SponsorLinkText table, and gets the words to link/the link itself, and a few other bits of info 2) Split the words up - and then does: `foreach (@words) { print qq\|Looking for "$_" in "$desc" <br />\n\|; if ($hit->{TheText}) { $desc =~ s\|\b$_\|<a href="$hit->{LinkURL}" title="$hit->{Th +eText}" target="$hit->{Target}">$_</a>\|sg; } else { $desc =~ s\|\b$_\|<a href="$hit->{LinkURL}" target="$hit->{T +arget}">$_</a>\|sg; } }` [download] (which basically links the words to their appropriate letters - at least in theory =)) TIA for any more suggestions. Andy	[reply] [d/l] [select]
Re^2: Annoying Regex issue with forign chars by ultranerds (Hermit) on Nov 05, 2008 at 12:12 UTC
Hi, Also, regarding your suggestion of: `$desc =~ s\|(<=\s)$_(<=\s)\|<a href="$hit->{LinkURL}" target="$hit->{Target}">$_</a>\|sg;` ...is that what you mean? TIA Andy	[reply] [d/l]
Re^3: Annoying Regex issue with forign chars by ultranerds (Hermit) on Nov 05, 2008 at 12:21 UTC
Hi, Ok, this seems to be working ok (and should do what I need); `$desc =~ s\|([\s\.]?)$_([\s\.]?)\|$1<a href="$hit->{LinkURL}" title="$hit->{TheText}" target="$hit->{Target}">$_</a>$2\|sg;` Thanks for the help guys :) Cheers Andy	[reply] [d/l]
Re^4: Annoying Regex issue with forign chars by ultranerds (Hermit) on Nov 05, 2008 at 14:41 UTC