in reply to variables in regex character classes

Your code contains an error. Even if it compiled, it would not search for what you want. The qr'red strings are interpolated as (?-xism:$string). So you would actually search for e.g. /\btak[(?-xism:ABVH)]/, what i'm sure is not what you want.

Consider the following code (i can give an example in Cyrillic-Windows-1251, but i don't know if it's compatible with Belorussian variant):

my $letters = '[a-zA-Z]'; # try - maybe the character range will work +for you. It works in Cyr-1521, but does not in KOI8-R my @wordpatterns = map { qr/(?<!$letters)(\Q$_\E$letters*)/ } qw(tak h +et hen toj); while (my $next_line = <$FH>) { foreach my $pattern (@wordpatterns) { my $count_words = ($next_line =~ s/$pattern/>$1</gi); } }
The code is untested, but must do the Right Thing in much better way than yours. It re-uses the generated regexes and may make processing of large amounts of data noticeably faster.

One more note—the /i switch won't work for Cyrillic encodings without carefully set locale. The behaviur of boundaries (\b) is wrong, if the locale is wrong—so i removed them from my regex too.


     s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print

Replies are listed 'Best First'.
Re^2: variables in regex character classes
by amir_e_a (Hermit) on Jul 23, 2006 at 14:37 UTC

    Thanks for the suggestions. I'll try it.

    I had a hunch that i am too clever about using qr//.

    I prefer using UTF-8 as the encoding. If i use Unicode, do i still need to set a locale? I didn't set a locale, but i saved all the relevant files as UTF-8 and said

    use encoding 'utf8'; ... open my $FILE_HANDLE, "<:utf8", $FILE_NAME;

    ... And it seems that \b works as intended, even if the rest of the pattern is not so good :)

    Character range is problematic for Belarusian, because in Unicode the order of the letters is the Russian standard, and Belarusian is slightly different. So i think that it is safest to simply write all the possible letters.

    Any thoughts?...

      If you use Unicode, you don't need to set a locale. And using Unicode is much better than setting a locale.
      But it's better to specify
      use utf8;
      instead of use encoding 'utf8';.

      On Unicode data \b, as well as the /i switch, will work as expected. And if you are not sure about the character ranges, it's of course better to type the alphabet.

      Good luck!

           s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print