Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

variables in regex character classes

by amir_e_a (Hermit)
on Jul 22, 2006 at 19:32 UTC ( [id://563037]=perlquestion: print w/replies, xml ) Need Help??

amir_e_a has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

Consider this code:

use strict; use warnings; my $BEL_LETTERS = qr/ABVHD/; my @all_words = qw(tak het hen toj tah tam tym); ... while (my $next_line = <$FH>) { foreach my $next_word (@all_words) { my $count_words = ($next_line =~ s/\b($next_word[$BEL_LETTERS]*)\b/>$1</gi); } }

The substitution looks for words that begin with any of @all_words and end with zero or more of $BEL_LETTERS.

This code doesn't compile. Problem: the compiler thinks that $next_word[$BEL_LETTERS] is member $BEL_LETTERS of array @next_word.

This works - notice the parentheses:

s/\b($next_word([$BEL_LETTERS])*)\b/>$1</gi);

...But it feels like a hack to me. Is there a better way to overcome this interpolation?

I tried searching for the answer here in Q&A and in the Camel book, and strangely couldn't find it. I'm quite sure that the answer must be in the book, and if someone can point it to the page i'll be very thankful too.

(If anyone is curious, this code is looking for pronouns in a Belarusian text. This example here is very simplified - i didn't want to put the whole Belarusian-Cyrillic alphabet here).

Replies are listed 'Best First'.
Re: variables in regex character classes
by Hue-Bond (Priest) on Jul 22, 2006 at 19:48 UTC
    the compiler thinks that $next_word[$BEL_LETTERS] is member $BEL_LETTERS of array @next_word.

    Use ${nextword}[$BEL_LETTERS]. It's documented in perldata, search for "there is an unfortunate ambiguity".

    --
    David Serrano

Re: variables in regex character classes
by Joost (Canon) on Jul 22, 2006 at 19:50 UTC
Re: variables in regex character classes
by Ieronim (Friar) on Jul 22, 2006 at 20:20 UTC
    Your code contains an error. Even if it compiled, it would not search for what you want. The qr'red strings are interpolated as (?-xism:$string). So you would actually search for e.g. /\btak[(?-xism:ABVH)]/, what i'm sure is not what you want.

    Consider the following code (i can give an example in Cyrillic-Windows-1251, but i don't know if it's compatible with Belorussian variant):

    my $letters = '[a-zA-Z]'; # try - maybe the character range will work +for you. It works in Cyr-1521, but does not in KOI8-R my @wordpatterns = map { qr/(?<!$letters)(\Q$_\E$letters*)/ } qw(tak h +et hen toj); while (my $next_line = <$FH>) { foreach my $pattern (@wordpatterns) { my $count_words = ($next_line =~ s/$pattern/>$1</gi); } }
    The code is untested, but must do the Right Thing in much better way than yours. It re-uses the generated regexes and may make processing of large amounts of data noticeably faster.

    One more note—the /i switch won't work for Cyrillic encodings without carefully set locale. The behaviur of boundaries (\b) is wrong, if the locale is wrong—so i removed them from my regex too.


         s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print

      Thanks for the suggestions. I'll try it.

      I had a hunch that i am too clever about using qr//.

      I prefer using UTF-8 as the encoding. If i use Unicode, do i still need to set a locale? I didn't set a locale, but i saved all the relevant files as UTF-8 and said

      use encoding 'utf8'; ... open my $FILE_HANDLE, "<:utf8", $FILE_NAME;

      ... And it seems that \b works as intended, even if the rest of the pattern is not so good :)

      Character range is problematic for Belarusian, because in Unicode the order of the letters is the Russian standard, and Belarusian is slightly different. So i think that it is safest to simply write all the possible letters.

      Any thoughts?...

        If you use Unicode, you don't need to set a locale. And using Unicode is much better than setting a locale.
        But it's better to specify
        use utf8;
        instead of use encoding 'utf8';.

        On Unicode data \b, as well as the /i switch, will work as expected. And if you are not sure about the character ranges, it's of course better to type the alphabet.

        Good luck!

             s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://563037]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2024-03-28 11:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found