Better way to search array for elements in other array?

echo5 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Better way to search array for elements in other array? by Limbic~Region (Chancellor) on Jan 24, 2011 at 19:53 UTC
echo5, It may be advantageous to look at Regexp::Assemble or Regexp::Trie. Note that, thanks to demerphq, in perl 5.10 and beyond the regexp alternation optimization of a trie is built in. In other words, take the smaller of the two lists and turn it into a single regular expression. This turns your algorithm from O(N * M) to O(N). If this still doesn't make sense, please let me know and I will provide a working example that you can tweak to see if it fits your needs. Update: The code snippet `if (index(lc($data),lc($string)) ge 0) {` should likely just be `if (index($lc_data, $lc_string) > -1) {`. First, you should use > instead of ge when dealing with numerical values (see perlop). Second, repeatedly lowercasing the values is wasted time. Do it once and then compare the copy. Cheers - L~R	[reply] [d/l] [select]
Re^2: Better way to search array for elements in other array? by tilly (Archbishop) on Jan 24, 2011 at 20:10 UTC
This turns your algorithm from O(N M) to O(N).* Not exactly. Suppose that L is the length of the strings you are trying to match. With the naive approach the RE engine can take advantage of Boyer-Moore, so its running time should be about O(N * M / L). With the trie approach the number of times that the re engine will start trying to match is O(N), however each attempted match that fails can cost up to O(L), so your running time is potentially as high as O(N * L). (Usually you won't hit this worst case though.) If you have a lot of strings to match, this is potentially a big win. If you have a small number of long strings to match, this is probably not a good optimization.	[reply]
Re^3: Better way to search array for elements in other array? by Limbic~Region (Chancellor) on Jan 24, 2011 at 20:20 UTC
tilly, I was very careful through out my post to say things like "may" and "to see if it fits". I should have said of the algorithm analysis....if you don't look too closely, this turns your algorithm.... I appreciate the clarification. I have other ideas about how this could be optimized for run time but the OP didn't provide enough real sample data. Cheers - L~R	[reply]
Re^2: Better way to search array for elements in other array? by ikegami (Patriarch) on Jan 24, 2011 at 20:37 UTC
First, you should use > instead of ge when dealing with numerical values Well, ">=" would be the numerical equivalent to the string comparison operator "ge", so `index(...) >= 0` would be perfectly fine. The reason the string comparison operators are not suitable for numerical comparisons is that they don't use different rules than the numerical comparison operators. `$ perl -E'say 2 >= 10 ? 1 :0' 0 $ perl -E'say 2 ge 10 ? 1 :0' 1 # Because 2 ge 1` [download]	[reply] [d/l] [select]
Re^3: Better way to search array for elements in other array? by Limbic~Region (Chancellor) on Jan 24, 2011 at 21:30 UTC
ikegami, Well, ">=" would be the numerical equivalent You took that out of context. I also changed the 0 to a -1 which made > appropriate. Regarding the explanation for string operators versus numerical operators, I thought a pointer to perlop sufficient. Cheers - L~R	[reply]
Re^4: Better way to search array for elements in other array? by ikegami (Patriarch) on Jan 24, 2011 at 23:07 UTC
Re^5: Better way to search array for elements in other array? by Limbic~Region (Chancellor) on Jan 24, 2011 at 23:23 UTC
Some notes below your chosen depth have not been shown here
Re: Better way to search array for elements in other array? by kennethk (Abbot) on Jan 24, 2011 at 19:59 UTC
Rather than using index and lc for the comparison, I would use regular expressions. In your case I'd want the `i` modifier for a case-insensitive search and the alternator `\|`. Perhaps your code might look something like: `@files = <FILE>; chomp(@files); my @strings = qw(ABC DEF GHI JKL); my $regex = join '\|', map "\Q$_\E", @strings; foreach my $file (@files) { my $data = slurp $file; if ($data =~ /$regex/i) { print " Match\n"; } }` [download] Note this is untested.	[reply] [d/l] [select]
Re^2: Better way to search array for elements in other array? by ikegami (Patriarch) on Jan 24, 2011 at 20:47 UTC
$regex needs to be string compared to the $regex from the previous pass of the loop, so I don't see the gain unless you use a regex instead of a regex pattern. `my @strings = qw(ABC DEF GHI JKL); my $regex = join '\|', map "\Q$_\E", @strings; $regex = qr/$regex/i; foreach my $file (@files) { my $data = slurp $file; if ($data =~ $regex) { print " Match\n"; } }` [download] There may be a gain from not using /i. `my @strings = qw(ABC DEF GHI JKL); my $regex = lc join '\|', map "\Q$_\E", @strings; $regex = qr/$regex/; foreach my $file (@files) { my $data = slurp $file; if (lc($data) =~ $regex) { print " Match\n"; } }` [download] It seems to disable 5.10's trie optimisation, for starters. $ perl -Mre=debug -e'qr/foo\|bar\|baz\|boo/' Compiling REx "foo\|bar\|baz\|boo" Final program: 1: TRIEC-EXACT[bf] (13) <foo> <bar> <baz> <boo> 13: END (0) stclass AHOCORASICKC-EXACT[bf] minlen 3 Freeing REx: "foo\|bar\|baz\|boo" $ perl -Mre=debug -e'qr/foo\|bar\|baz\|boo/i' Compiling REx "foo\|bar\|baz\|boo" Final program: 1: BRANCH (4) 2: EXACTF <foo> (13) 4: BRANCH (7) 5: EXACTF <bar> (13) 7: BRANCH (10) 8: EXACTF <baz> (13) 10: BRANCH (FAIL) 11: EXACTF <boo> (13) 13: END (0) minlen 3 Freeing REx: "foo\|bar\|baz\|boo" [download]	[reply] [d/l] [select]
Re^2: Better way to search array for elements in other array? by echo5 (Novice) on Jan 24, 2011 at 20:08 UTC
I will try both, just to broaden my skills. As a quick shot the $regex approach worked and eliminated making a loop pass for every element in the array in my example. Many thanks!	[reply]
Re: Better way to search array for elements in other array? by SuicideJunkie (Vicar) on Jan 24, 2011 at 20:47 UTC
What if you take one list of strings and convert it to a `HoHoHo...`? The top level key would be the first letter of each string. At each level, you have either the string to `eq` against, or a hash of choices for the next letter. `EG: my $tree = {f => {o => 'foo', i => {r => 'fire', n => 'fine', }, }, b => 'bar', }` [download] You would know instantly that 'monk' isn't in the second list because `!exists($tree->{m})`. You would know 'balloon' doesn't exist pretty quick because `exists($tree->{b}) and ref($tree->{b}) ne 'HASH' and $tree->{b} ne 'balloon'`. If the second list is something constant without too much duplication early in the string, such as a dictionary, then it should work really well.	[reply] [d/l] [select]
Re^2: Better way to search array for elements in other array? by Limbic~Region (Chancellor) on Jan 24, 2011 at 21:49 UTC
SuicideJunkie, Explain to me how that is going to efficiently tell me that: `"Have a Merry Christmas and a Happy New Year" contains "and"` [download] The trouble is that it can appear anywhere in the string so you either need to build the data structure for every starting point in the string or repeat the test for as many characters as are in the string (unless I am missing something). Cheers - L~R	[reply] [d/l]
Re^3: Better way to search array for elements in other array? by SuicideJunkie (Vicar) on Jan 25, 2011 at 16:16 UTC
I was under the impression that the array elements were entire strings... that the search was for stuff in the other array, rather than for stuff inside the strings in the other array. Given that, you're right that it is solving a completely different problem.	[reply]
Re^2: Better way to search array for elements in other array? by eff_i_g (Curate) on Jan 24, 2011 at 21:12 UTC
I saw code for this recently in Find Words a la Boggle. You can find the spot by searching for "dictionary tree" in the code.	[reply]