Re: variables in regex character classes

Your code contains an error. Even if it compiled, it would not search for what you want. The qr'red strings are interpolated as (?-xism:$string). So you would actually search for e.g. /\btak[(?-xism:ABVH)]/, what i'm sure is not what you want.

Consider the following code (i can give an example in Cyrillic-Windows-1251, but i don't know if it's compatible with Belorussian variant):

my $letters = '[a-zA-Z]'; # try - maybe the character range will work 
+for you. It works in Cyr-1521, but does not in KOI8-R
my @wordpatterns = map { qr/(?<!$letters)(\Q$_\E$letters*)/ } qw(tak h
+et hen toj);

while (my $next_line = <$FH>) {
    foreach my $pattern (@wordpatterns) {
        my $count_words = ($next_line =~ s/$pattern/>$1</gi);
    }
}
[download]

The code is untested, but must do the Right Thing in much better way than yours. It re-uses the generated regexes and may make processing of large amounts of data noticeably faster.

One more note—the /i switch won't work for Cyrillic encodings without carefully set locale. The behaviur of boundaries (\b) is wrong, if the locale is wrong—so i removed them from my regex too.

s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print

Comment on Re: variables in regex character classes Select or Download Code

Replies are listed 'Best First'.
Re^2: variables in regex character classes by amir_e_a (Hermit) on Jul 23, 2006 at 14:37 UTC
Thanks for the suggestions. I'll try it. I had a hunch that i am too clever about using qr//. I prefer using UTF-8 as the encoding. If i use Unicode, do i still need to set a locale? I didn't set a locale, but i saved all the relevant files as UTF-8 and said `use encoding 'utf8'; ... open my $FILE_HANDLE, "<:utf8", $FILE_NAME;` [download] ... And it seems that \b works as intended, even if the rest of the pattern is not so good :) Character range is problematic for Belarusian, because in Unicode the order of the letters is the Russian standard, and Belarusian is slightly different. So i think that it is safest to simply write all the possible letters. Any thoughts?...	[reply] [d/l]
Re^3: variables in regex character classes by Ieronim (Friar) on Jul 23, 2006 at 17:32 UTC
If you use Unicode, you don't need to set a locale. And using Unicode is much better than setting a locale. But it's better to specify `use utf8;` [download] instead of `use encoding 'utf8';`. On Unicode data `\b`, as well as the `/i` switch, will work as expected. And if you are not sure about the character ranges, it's of course better to type the alphabet. Good luck! `s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print`	[reply] [d/l] [select]