Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Good evening

This is a piece of my code that is taking up way to much time. The contents of @data are as large as 60 meg

foreach my $record (@data) { foreach my $data (@m_info) { if ($record->[1] =~ /$data->[0]/) { $record->[1] = $data->[3]; $record->[1] = join ("", $record->[1], "_", $data->[1]); } } }
It needs to iterate through an array (@data) and then iterate through another array(@m_info) and look for a match.

I have though about using a hash keyed by the record info in the @data array, but @m_info is an array of arrays.

Your thoughts are greatly appreciated !

Replies are listed 'Best First'.
Re: More efficient Data Structure needed
by BrowserUk (Patriarch) on Dec 18, 2002 at 03:05 UTC

    This might work. It compiles clean but need your data to test it.

    #! perl -slw use strict; my (@m_info, @data); #! Build a hash keyed by the first elements of the info data. my %m_info = map{ $_->[0] => $_ } @m_info; #! Build a (this|that|the other) regex from the search keys (Method by + [theDamian]). #! Updated to correct *my* errors when adapting [the Damian]'s code as + spotted by [Aristotle]++ below. my $regex = qr/(@{ [ join '|', map { "\Q$_\E" } keys %m_info ] })/; for my $record (@data) { if ( $record =~ $regex ) { #! $1 contains the key that matched. #! The corresponding array can be looked up from the hash my $info = $m_info{$1}; #! do the rest of your stuff #!$info->[3]... } }

    Examine what is said, not who speaks.

      Very close, except for a few little mistakes: "(\Q$_)" must be "(\Q$_\E)", or else the closing paren will be escaped. But you don't want parens there anyway, it has to be one single capture group: my $rx =  qr/(@{[ join '|', map quotemeta, keys %m_info ]})/; Another, significantly more involved method that I first came up with when I encountered such a problem:
      my $rx = qr/@{ [ join '|', map "(\Q$_->[0]\E)", @m_info ] }/; for my $record (@data) { if (my @match = $record->[1] =~ $regex) { my ($m) = grep defined $match[$_], 0 .. $#match; $record->[1] = join "_", $m_info[$m]->[3], $m_info[$m]->[1]; } }
      The hash method is definitely preferrable though. Reasons to choose one over the other: a large @m_info will take less memory, a large %m_info will provide much faster lookups than the grep loop. So long as memory is not a constraint, the hash is the better choice as the code is both faster and more readable.

      Makeshifts last the longest.

Re: More efficient Data Structure needed
by pg (Canon) on Dec 18, 2002 at 02:11 UTC
    Too much waste, lots of improvement you can make. At least you can add a last statement like this:
    foreach my $record (@data) { foreach my $data (@m_info) { if ($record->[1] =~ /$data->[0]/) { $record->[1] = $data->[3]; $record->[1] = join ("", $record->[1], "_", $data->[1]); last;#jump out of the inner loop if you found the match } } }
    Another suggestion is that, you can have both arrays sorted, then you only need to go thru both of them once, now we are talking about plus, no more multiply.

    Well, of course sort itself will take up time. You have to make a decision base on the nature of your key.

    In this kind of case, read a university text book would give you more help than a Perl book. I am serious.
Re: More efficient Data Structure needed
by dws (Chancellor) on Dec 18, 2002 at 05:50 UTC
    The contents of @data are as large as 60 meg

    Assuming you mean there can be 60 million entries in @data, you may be suffering from memory thrash while you're rewriting your data records. Is there any way that you can do the record rewriting while you're building up @data?

Re: More efficient Data Structure needed
by djantzen (Priest) on Dec 18, 2002 at 02:15 UTC

    A lookup hash is a good idea, as they are faster than iterating through an array. Also, you can do a normal string equality test (eq) rather than compiling the regular expression and firing up the regex engine. Finally, you can use just normal string interpolation rather than join since you're doing nothing too fancy there.

    It looks like @data is likewise two-dimensional, so assuming that here's something that might work for you:

    my %lookup = map { $_->[0] => $_ } @m_info; # key each array by its fi +rst entry. foreach my $record (@data) { my $data = $lookup{$record->[1]}; $record->[1] = "$data->[3]_$data->[1]" if $data; }

      The problem with that is that he isn't doing a one-for-one compare between the two things, he's checking to see if one contains the other (hence the regex). Which make sthe whole problem rather harder.

      if ($record->[1] =~ /$data->[0]/) {

      Examine what is said, not who speaks.

        Actually that isn't made explicitly clear in the original post, although I agree the absence of anchors certainly indicates that you're correct.

        Tie::Hash::Regex would work in that case, but unfortunately all the efficiency of the hash would be lost as that module simply iterates over the keys in the hash doing a regex match :^\

        If he only needs substring matching, then index is faster and safer than regular expressions. He wouldn't have to worry about escaping characters or compiling the regex.
        if (index($record->[1], $data->[0]) >= 0) {
Re: More efficient Data Structure needed
by Monky Python (Scribe) on Dec 18, 2002 at 07:43 UTC
    Hi,
    you should change the inner if to:
    if ($record->[1] =~ /$data->[0]/) { $record->[1] = join ("_", $data->[3], $data->[1]); last; }
    MP