Regex checking text is made up of the keys of a hash.

heezy has asked for the wisdom of the Perl Monks concerning the following question:

I have a hash containing language codes..

my %validLanguages = (
        "de" => "german",
        "en" => "english",
        "es" => "spanish",
        "fr" => "french",
        "it" => "italian",
        "ja" => "jap",
        "ko" => "korean",
        "ru" => "rus,
        "sv" => "WHATS THIS",
        "zh" => "WHATS THIS",
        "zh_TW" => "WHATS THIS"
        );
[download]

...and I want to build a procedure that checks if a piece of text passed to it as a parameter is made up of only...

one or more keys from the language hash
white space
commas

How easy is this to do? I thought I could just use a regex but then I wanted to call the keys of the hash in it and it all got far to complicated.

This is how far I got :(


# returns any value if it is a valid list e.g...
# "de, en, fr, ja"
# return undef if it is not a valid list e.g...
# "monkish fr, en"
# a valid list will always be in the form...
# language,\slanguage,\slanguage....


sub isItJustAListOfLanguages{

    my %validLanguages = (
              "de" => "german",
              "en" => "english",
              "es" => "spanish",
              "fr" => "french",
              "it" => "italian",
              "ja" => "japanise",
              "ko" => "korean",
              "ru" => "russian",
              "sv" => "WHATS THIS",
              "zh" => "WHATS THIS",
              "zh_TW" => "WHATS THIS"
              );
    $textToTest = $_[0];
}
[download]

Help!

Comment on Regex checking text is made up of the keys of a hash. Select or Download Code

Replies are listed 'Best First'.
Re: Regex checking text is made up of the keys of a hash. by Zaxo (Archbishop) on Mar 01, 2003 at 03:02 UTC
Given the fixed hash %validLanguages (note that there is a level of hell for devising that name), Start the sub definition and shift in one argument: `sub is_valid_list { my $text = shift;` [download] now split the string on combinations of a character class of comma and whitespace: `my @langs = grep {defined} split /[\s,]+/, $text;` look for nonexistent keys and return the negation of scalar of that list - gives zero if some were found, one if none were. End of sub. `! grep { ! exists $validLanguages{$_} } @langs; }` [download] That's untested but I figure it will work. Update: The original failed for strings with `[\s,]+` at the ends. Inserted `grep {defined}` to take care of that. After Compline, Zaxo	[reply] [d/l] [select]
Re: Regex checking text is made up of the keys of a hash. by blakem (Monsignor) on Mar 01, 2003 at 03:04 UTC
It might not really be what you want, but I believe it fits the specs: #!/usr/bin/perl -wT use strict; my %validLanguages = ( "de" => "german", "en" => "english", "es" => "spanish", "fr" => "french", "it" => "italian", "ja" => "japanise", "ko" => "korean", "ru" => "russian", "sv" => "WHATS THIS", "zh" => "WHATS THIS", "zh_TW" => "WHATS THIS" ); # test it for ('dog','cat','de,en',' ',',,,','deensvzh') { printf "%-10s %s a list of languages\n", "'$_'", (isItJustAListOfLanguages($_) ? "is" : "is NOT"); } sub isItJustAListOfLanguages{ my $text = shift; my @tokens = (keys %validLanguages, '\s',','); my $tokenpatt = join('\|',@tokens); return $text =~ /^($tokenpatt)+$/; } __END__ 'dog' is NOT a list of languages 'cat' is NOT a list of languages 'de,en' is a list of languages ' ' is a list of languages ',,,' is a list of languages 'deensvzh' is a list of languages [download] -Blake	[reply] [d/l]
Re: Re: Regex checking text is made up of the keys of a hash. by heezy (Monk) on Mar 01, 2003 at 23:04 UTC
This is so cool, it works so well and it's only 4 lines! thanks a lot!	[reply]
Re: Regex checking text is made up of the keys of a hash. by BrowserUk (Patriarch) on Mar 01, 2003 at 04:15 UTC
You might not want the case insensitivity or to allow spaces between the language token and the comma, but I added them for completeness. #! perl -slw use strict; my %validLanguages = ( "de" => "german", "en" => "english", "es" => "spanish", "fr" => "french", "it" => "italian", "ja" => "japanise", "ko" => "korean", "ru" => "russian", "sv" => "WHATS THIS", "zh" => "WHATS THIS", "zh_TW" => "WHATS THIS" ); my $re_langs = join'\|', keys %validLanguages; $re_langs = qr[\s(?:$re_langs)\s(?:,\|$)]io; sub isOnlyLangs{ my ($string) = @_; $string =~ s[$re_langs][]g; $string =~ m[^\s*$]; } sub isOnlyLangs_{ (my $s = $_[0]) =~ s[$re_langs][]g; !$s; } print isOnlyLangs($_) ? 'Passed : ' : 'Failed : ', "'$_'" for 'de, en, fr, ja', ' de , en , fr , ja , ', 'monkish fr, en', 'monkish, fr, en', 'FR', 'Fr', 'fr en', 'fr , en,', 'zh_tw'; [download] Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply] [d/l]
Re: Regex checking text is made up of the keys of a hash. by hv (Prior) on Mar 01, 2003 at 14:19 UTC
It isn't entirely clear how strict you want the match to be: whether any of "defr", "de fr", ",," or "de,,fr" should be accepted. Let's start by joining the valid languages: `my $re_langs = sprintf '(?:%s)', join '\|', keys %validLanguages;` [download] Note that wrapping the alternatives in non-capturing parens allows me to treat `$re_langs` as if it were an atom in the examples below. Now, if any combination of languages, whitespace and commas is ok: `$text =~ /^(?:$re_langs\|\s\|,)\z/;` [download] That allows the empty string and all the examples above. To fail on the empty string you can replace the '' in the pattern (zero or more) with '+' (one or more). To allow any combination, but require whitespace or commas separating languages (so that "defr" is not allowed) we require each language to be followed either by a separator or end of string: `$text =~ /^(?:$re_langs(?=\s\|,\|\z)\|\s\|,)\z/;` [download] That pattern can also be made simpler and faster if the language strings always start and end with a word character: `$text =~ /^(?:$re_langs\b\|\s\|,)\z/;` [download] If additionally the comma is optional but cannot appear multiple times, so that "de fr" is ok but "de,,fr" is not, one way would be to extend the pattern to say precisely that: `$text =~ m{ ^ (?: $re_langs \b \| \s \| , (?! \s, ) # comma not followed by another comma # (not even with intervening whitespace) ) \z }x;` [download] However it is probably more efficient to encode the patterns that must follow each language: `$text =~ m{ ^ \s* ( ,\s* )? # allow stuff to precede first language (?: $re_langs (?: \s+ \| \s* , \s* \| \z ) )* \z }x;` [download] Finally, if each language must be followed by a comma but the final comma is optional, and all whitespace is optional: `$text =~ /^\s(?:$re_langs\s(?:,\s\|\z))\z/;` [download] I hope that gives you some useful options to consider, but please keep in mind that all the examples above are untested. Hugo	[reply] [d/l] [select]
Re: Regex checking text is made up of the keys of a hash. by heezy (Monk) on Mar 01, 2003 at 23:08 UTC
Thanks to everyone who replied to this posting, I eventually adopted the solution from blakem but my thanks goes to all of you!	[reply]