Re: Better way to search array for elements in other array?
by Limbic~Region (Chancellor) on Jan 24, 2011 at 19:53 UTC
|
echo5,
It may be advantageous to look at Regexp::Assemble or Regexp::Trie. Note that, thanks to demerphq, in perl 5.10 and beyond the regexp alternation optimization of a trie is built in.
In other words, take the smaller of the two lists and turn it into a single regular expression. This turns your algorithm from O(N * M) to O(N). If this still doesn't make sense, please let me know and I will provide a working example that you can tweak to see if it fits your needs.
Update: The code snippet if (index(lc($data),lc($string)) ge 0) { should likely just be if (index($lc_data, $lc_string) > -1) {. First, you should use > instead of ge when dealing with numerical values (see perlop). Second, repeatedly lowercasing the values is wasted time. Do it once and then compare the copy.
| [reply] [d/l] [select] |
|
|
This turns your algorithm from O(N * M) to O(N).
Not exactly. Suppose that L is the length of the strings you are trying to match. With the naive approach the RE engine can take advantage of Boyer-Moore, so its running time should be about O(N * M / L). With the trie approach the number of times that the re engine will start trying to match is O(N), however each attempted match that fails can cost up to O(L), so your running time is potentially as high as O(N * L). (Usually you won't hit this worst case though.)
If you have a lot of strings to match, this is potentially a big win. If you have a small number of long strings to match, this is probably not a good optimization.
| [reply] |
|
|
tilly,
I was very careful through out my post to say things like "may" and "to see if it fits". I should have said of the algorithm analysis....if you don't look too closely, this turns your algorithm.... I appreciate the clarification. I have other ideas about how this could be optimized for run time but the OP didn't provide enough real sample data.
| [reply] |
|
|
First, you should use > instead of ge when dealing with numerical values
Well, ">=" would be the numerical equivalent to the string comparison operator "ge", so index(...) >= 0 would be perfectly fine.
The reason the string comparison operators are not suitable for numerical comparisons is that they don't use different rules than the numerical comparison operators.
$ perl -E'say 2 >= 10 ? 1 :0'
0
$ perl -E'say 2 ge 10 ? 1 :0'
1 # Because 2 ge 1
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
|
|
|
Re: Better way to search array for elements in other array?
by kennethk (Abbot) on Jan 24, 2011 at 19:59 UTC
|
Rather than using index and lc for the comparison, I would use regular expressions. In your case I'd want the i modifier for a case-insensitive search and the alternator |. Perhaps your code might look something like:
@files = <FILE>;
chomp(@files);
my @strings = qw(ABC DEF GHI JKL);
my $regex = join '|', map "\Q$_\E", @strings;
foreach my $file (@files) {
my $data = slurp $file;
if ($data =~ /$regex/i) {
print " Match\n";
}
}
Note this is untested. | [reply] [d/l] [select] |
|
|
my @strings = qw(ABC DEF GHI JKL);
my $regex = join '|', map "\Q$_\E", @strings;
$regex = qr/$regex/i;
foreach my $file (@files) {
my $data = slurp $file;
if ($data =~ $regex) {
print " Match\n";
}
}
There may be a gain from not using /i.
my @strings = qw(ABC DEF GHI JKL);
my $regex = lc join '|', map "\Q$_\E", @strings;
$regex = qr/$regex/;
foreach my $file (@files) {
my $data = slurp $file;
if (lc($data) =~ $regex) {
print " Match\n";
}
}
It seems to disable 5.10's trie optimisation, for starters.
$ perl -Mre=debug -e'qr/foo|bar|baz|boo/'
Compiling REx "foo|bar|baz|boo"
Final program:
1: TRIEC-EXACT[bf] (13)
<foo>
<bar>
<baz>
<boo>
13: END (0)
stclass AHOCORASICKC-EXACT[bf] minlen 3
Freeing REx: "foo|bar|baz|boo"
$ perl -Mre=debug -e'qr/foo|bar|baz|boo/i'
Compiling REx "foo|bar|baz|boo"
Final program:
1: BRANCH (4)
2: EXACTF <foo> (13)
4: BRANCH (7)
5: EXACTF <bar> (13)
7: BRANCH (10)
8: EXACTF <baz> (13)
10: BRANCH (FAIL)
11: EXACTF <boo> (13)
13: END (0)
minlen 3
Freeing REx: "foo|bar|baz|boo"
| [reply] [d/l] [select] |
|
|
I will try both, just to broaden my skills. As a quick shot the $regex approach worked and eliminated making a loop pass for every element in the array in my example.
Many thanks!
| [reply] |
Re: Better way to search array for elements in other array?
by SuicideJunkie (Vicar) on Jan 24, 2011 at 20:47 UTC
|
What if you take one list of strings and convert it to a HoHoHo...?
The top level key would be the first letter of each string. At each level, you have either the string to eq against, or a hash of choices for the next letter.
EG:
my $tree =
{f =>
{o => 'foo',
i =>
{r => 'fire',
n => 'fine',
},
},
b => 'bar',
}
You would know instantly that 'monk' isn't in the second list because !exists($tree->{m}). You would know 'balloon' doesn't exist pretty quick because exists($tree->{b}) and ref($tree->{b}) ne 'HASH' and $tree->{b} ne 'balloon'.
If the second list is something constant without too much duplication early in the string, such as a dictionary, then it should work really well. | [reply] [d/l] [select] |
|
|
SuicideJunkie,
Explain to me how that is going to efficiently tell me that:
"Have a Merry Christmas and a Happy New Year"
contains
"and"
The trouble is that it can appear anywhere in the string so you either need to build the data structure for every starting point in the string or repeat the test for as many characters as are in the string (unless I am missing something).
| [reply] [d/l] |
|
|
I was under the impression that the array elements were entire strings... that the search was for stuff in the other array, rather than for stuff inside the strings in the other array.
Given that, you're right that it is solving a completely different problem.
| [reply] |
|
|
I saw code for this recently in Find Words a la Boggle. You can find the spot by searching for "dictionary tree" in the code.
| [reply] |