echo5 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I have two arrays, one is the slurped contents of multiple files and the other is a list of strings. The goal is to check the file array to see if any of the elements of the other array exist.

Right now I am doing this in a very clumsy way:

@files = <FILE>; chomp(@files); @strings = qw(ABC DEF GHI JKL); foreach $file (@files) { my $data = slurp $file; foreach $string(@strings) { if (index(lc($data),lc($string)) ge 0) { print " Match\n"; } }

I am using nested foreach loops to run the search on $data for each element in @strings. I am sure this is very clumsy.

What I want is to put compare @strings against @data once and if there is a match on any element of @strings inside of @data then perform the print of "Match".

Thanks

Replies are listed 'Best First'.
Re: Better way to search array for elements in other array?
by Limbic~Region (Chancellor) on Jan 24, 2011 at 19:53 UTC
    echo5,
    It may be advantageous to look at Regexp::Assemble or Regexp::Trie. Note that, thanks to demerphq, in perl 5.10 and beyond the regexp alternation optimization of a trie is built in.

    In other words, take the smaller of the two lists and turn it into a single regular expression. This turns your algorithm from O(N * M) to O(N). If this still doesn't make sense, please let me know and I will provide a working example that you can tweak to see if it fits your needs.

    Update: The code snippet if (index(lc($data),lc($string)) ge 0) { should likely just be if (index($lc_data, $lc_string) > -1) {. First, you should use > instead of ge when dealing with numerical values (see perlop). Second, repeatedly lowercasing the values is wasted time. Do it once and then compare the copy.

    Cheers - L~R

      This turns your algorithm from O(N * M) to O(N).

      Not exactly. Suppose that L is the length of the strings you are trying to match. With the naive approach the RE engine can take advantage of Boyer-Moore, so its running time should be about O(N * M / L). With the trie approach the number of times that the re engine will start trying to match is O(N), however each attempted match that fails can cost up to O(L), so your running time is potentially as high as O(N * L). (Usually you won't hit this worst case though.)

      If you have a lot of strings to match, this is potentially a big win. If you have a small number of long strings to match, this is probably not a good optimization.

        tilly,
        I was very careful through out my post to say things like "may" and "to see if it fits". I should have said of the algorithm analysis....if you don't look too closely, this turns your algorithm.... I appreciate the clarification. I have other ideas about how this could be optimized for run time but the OP didn't provide enough real sample data.

        Cheers - L~R

      First, you should use > instead of ge when dealing with numerical values

      Well, ">=" would be the numerical equivalent to the string comparison operator "ge", so index(...) >= 0 would be perfectly fine.

      The reason the string comparison operators are not suitable for numerical comparisons is that they don't use different rules than the numerical comparison operators.

      $ perl -E'say 2 >= 10 ? 1 :0' 0 $ perl -E'say 2 ge 10 ? 1 :0' 1 # Because 2 ge 1
        ikegami,
        Well, ">=" would be the numerical equivalent

        You took that out of context. I also changed the 0 to a -1 which made > appropriate. Regarding the explanation for string operators versus numerical operators, I thought a pointer to perlop sufficient.

        Cheers - L~R

Re: Better way to search array for elements in other array?
by kennethk (Abbot) on Jan 24, 2011 at 19:59 UTC
    Rather than using index and lc for the comparison, I would use regular expressions. In your case I'd want the i modifier for a case-insensitive search and the alternator |. Perhaps your code might look something like:

    @files = <FILE>; chomp(@files); my @strings = qw(ABC DEF GHI JKL); my $regex = join '|', map "\Q$_\E", @strings; foreach my $file (@files) { my $data = slurp $file; if ($data =~ /$regex/i) { print " Match\n"; } }

    Note this is untested.

      $regex needs to be string compared to the $regex from the previous pass of the loop, so I don't see the gain unless you use a regex instead of a regex pattern.

      my @strings = qw(ABC DEF GHI JKL); my $regex = join '|', map "\Q$_\E", @strings; $regex = qr/$regex/i; foreach my $file (@files) { my $data = slurp $file; if ($data =~ $regex) { print " Match\n"; } }

      There may be a gain from not using /i.

      my @strings = qw(ABC DEF GHI JKL); my $regex = lc join '|', map "\Q$_\E", @strings; $regex = qr/$regex/; foreach my $file (@files) { my $data = slurp $file; if (lc($data) =~ $regex) { print " Match\n"; } }

      It seems to disable 5.10's trie optimisation, for starters.

      $ perl -Mre=debug -e'qr/foo|bar|baz|boo/' Compiling REx "foo|bar|baz|boo" Final program: 1: TRIEC-EXACT[bf] (13) <foo> <bar> <baz> <boo> 13: END (0) stclass AHOCORASICKC-EXACT[bf] minlen 3 Freeing REx: "foo|bar|baz|boo" $ perl -Mre=debug -e'qr/foo|bar|baz|boo/i' Compiling REx "foo|bar|baz|boo" Final program: 1: BRANCH (4) 2: EXACTF <foo> (13) 4: BRANCH (7) 5: EXACTF <bar> (13) 7: BRANCH (10) 8: EXACTF <baz> (13) 10: BRANCH (FAIL) 11: EXACTF <boo> (13) 13: END (0) minlen 3 Freeing REx: "foo|bar|baz|boo"
      I will try both, just to broaden my skills. As a quick shot the $regex approach worked and eliminated making a loop pass for every element in the array in my example. Many thanks!
Re: Better way to search array for elements in other array?
by SuicideJunkie (Vicar) on Jan 24, 2011 at 20:47 UTC

    What if you take one list of strings and convert it to a HoHoHo...?

    The top level key would be the first letter of each string. At each level, you have either the string to eq against, or a hash of choices for the next letter.

    EG: my $tree = {f => {o => 'foo', i => {r => 'fire', n => 'fine', }, }, b => 'bar', }

    You would know instantly that 'monk' isn't in the second list because !exists($tree->{m}).
    You would know 'balloon' doesn't exist pretty quick because exists($tree->{b}) and ref($tree->{b}) ne 'HASH' and $tree->{b} ne 'balloon'.

    If the second list is something constant without too much duplication early in the string, such as a dictionary, then it should work really well.

      SuicideJunkie,
      Explain to me how that is going to efficiently tell me that:
      "Have a Merry Christmas and a Happy New Year" contains "and"

      The trouble is that it can appear anywhere in the string so you either need to build the data structure for every starting point in the string or repeat the test for as many characters as are in the string (unless I am missing something).

      Cheers - L~R

        I was under the impression that the array elements were entire strings... that the search was for stuff in the other array, rather than for stuff inside the strings in the other array.

        Given that, you're right that it is solving a completely different problem.

      I saw code for this recently in Find Words a la Boggle. You can find the spot by searching for "dictionary tree" in the code.