More efficient Data Structure needed

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: More efficient Data Structure needed by BrowserUk (Patriarch) on Dec 18, 2002 at 03:05 UTC
This might work. It compiles clean but need your data to test it. #! perl -slw use strict; my (@m_info, @data); #! Build a hash keyed by the first elements of the info data. my %m_info = map{ $_->[0] => $_ } @m_info; #! Build a (this\|that\|the other) regex from the search keys (Method by + [theDamian]). #! Updated to correct my errors when adapting [the Damian]'s code as + spotted by [Aristotle]++ below. my $regex = qr/(@{ [ join '\|', map { "\Q$_\E" } keys %m_info ] })/; for my $record (@data) { if ( $record =~ $regex ) { #! $1 contains the key that matched. #! The corresponding array can be looked up from the hash my $info = $m_info{$1}; #! do the rest of your stuff #!$info->[3]... } } [download] Examine what is said, not who speaks.	[reply] [d/l]
Re^2: More efficient Data Structure needed by Aristotle (Chancellor) on Dec 18, 2002 at 12:16 UTC
Very close, except for a few little mistakes: `"(\Q$_)"` must be `"(\Q$_\E)"`, or else the closing paren will be escaped. But you don't want parens there anyway, it has to be one single capture group: `my $rx = qr/(@{[ join '\|', map quotemeta, keys %m_info ]})/;` Another, significantly more involved method that I first came up with when I encountered such a problem: `my $rx = qr/@{ [ join '\|', map "(\Q$_->[0]\E)", @m_info ] }/; for my $record (@data) { if (my @match = $record->[1] =~ $regex) { my ($m) = grep defined $match[$_], 0 .. $#match; $record->[1] = join "_", $m_info[$m]->[3], $m_info[$m]->[1]; } }` [download] The hash method is definitely preferrable though. Reasons to choose one over the other: a large `@m_info` will take less memory, a large `%m_info` will provide much faster lookups than the grep loop. So long as memory is not a constraint, the hash is the better choice as the code is both faster and more readable. Makeshifts last the longest.	[reply] [d/l] [select]
Re: More efficient Data Structure needed by pg (Canon) on Dec 18, 2002 at 02:11 UTC
Too much waste, lots of improvement you can make. At least you can add a last statement like this: `foreach my $record (@data) { foreach my $data (@m_info) { if ($record->[1] =~ /$data->[0]/) { $record->[1] = $data->[3]; $record->[1] = join ("", $record->[1], "_", $data->[1]); last;#jump out of the inner loop if you found the match } } }` [download] Another suggestion is that, you can have both arrays sorted, then you only need to go thru both of them once, now we are talking about plus, no more multiply. Well, of course sort itself will take up time. You have to make a decision base on the nature of your key. In this kind of case, read a university text book would give you more help than a Perl book. I am serious.	[reply] [d/l]
Re: More efficient Data Structure needed by dws (Chancellor) on Dec 18, 2002 at 05:50 UTC
The contents of @data are as large as 60 meg Assuming you mean there can be 60 million entries in @data, you may be suffering from memory thrash while you're rewriting your data records. Is there any way that you can do the record rewriting while you're building up @data?	[reply]
Re: More efficient Data Structure needed by djantzen (Priest) on Dec 18, 2002 at 02:15 UTC
A lookup hash is a good idea, as they are faster than iterating through an array. Also, you can do a normal string equality test (eq) rather than compiling the regular expression and firing up the regex engine. Finally, you can use just normal string interpolation rather than join since you're doing nothing too fancy there. It looks like `@data` is likewise two-dimensional, so assuming that here's something that might work for you: `my %lookup = map { $_->[0] => $_ } @m_info; # key each array by its fi +rst entry. foreach my $record (@data) { my $data = $lookup{$record->[1]}; $record->[1] = "$data->[3]_$data->[1]" if $data; }` [download]	[reply] [d/l] [select]
Re: Re: More efficient Data Structure needed by BrowserUk (Patriarch) on Dec 18, 2002 at 02:43 UTC
The problem with that is that he isn't doing a one-for-one compare between the two things, he's checking to see if one contains the other (hence the regex). Which make sthe whole problem rather harder. `if ($record->[1] =~ /$data->[0]/) {` [download] Examine what is said, not who speaks.	[reply] [d/l]
Re: Re: Re: More efficient Data Structure needed by djantzen (Priest) on Dec 18, 2002 at 02:58 UTC
Actually that isn't made explicitly clear in the original post, although I agree the absence of anchors certainly indicates that you're correct. Tie::Hash::Regex would work in that case, but unfortunately all the efficiency of the hash would be lost as that module simply iterates over the keys in the hash doing a regex match :^\	[reply]
Re: Re: Re: More efficient Data Structure needed by iburrell (Chaplain) on Dec 18, 2002 at 17:10 UTC
If he only needs substring matching, then `index` is faster and safer than regular expressions. He wouldn't have to worry about escaping characters or compiling the regex. `if (index($record->[1], $data->[0]) >= 0) {` [download]	[reply] [d/l] [select]
Re: More efficient Data Structure needed by Monky Python (Scribe) on Dec 18, 2002 at 07:43 UTC
Hi, you should change the inner if to: `if ($record->[1] =~ /$data->[0]/) { $record->[1] = join ("_", $data->[3], $data->[1]); last; }` [download] MP	[reply] [d/l]