THRAK has asked for the wisdom of the Perl Monks concerning the following question:

I need to iterate over an array and find any entries that match the beginning of entries in a second array and combine that entry with the entry before it. From the cookbook I know how to trim an array down for exact matches (4.7, pg 104), but not for the partials I need to match/move. I've been able to accomplish this with the following code, but I've got to believe there are far more efficient ways of doing this. I've tried to reduce the iterations by purging the moved entries, which helps. But I've been unable to do the same with the search/destroy list. Basically I'm looking for a way to do this more efficiently because as the array sizes grow, I realize the time it takes to do this will go up dramatically.

Thanks.
#!/usr/local/bin/perl -w use strict; my @text_remove = ('[TEXT-2]', '[TEXT-4]', '[TEXT-7]'); my @xlate_data = ('[TEXT-1] Data Test ', '[TEXT-2] Second line tacked +onto #1', '[TEXT-3] Line #3', '[TEXT-4] 4th line goes with #3', '[TEX +T-5] five-five-FIVE', '[TEXT-6] Five was a Zappa reference...', '[TEX +T-7] This is a longer entry. But really not much different. I should + be part of #6'); my $rm_pos = 0; foreach my $move (@text_remove) { print "O: $rm_pos ~ $move\n"; my $xlate_pos = 0; foreach my $data (@xlate_data) { if ($data =~ /^\Q$move\E/) { print "I: $rm_pos ~ $move ~ $data\n\n"; $data =~ s/^\Q$move\E\s*/ /; $xlate_data[$xlate_pos -1] .= $data; splice(@xlate_data, $xlate_pos, 1); #remove entry to short +en loops } $xlate_pos++; # splice(@text_remove, $rm_pos, 1); #remove token entry to shorten +loops } $rm_pos++; } foreach (@xlate_data) { print "NOW: $_\n"; }


-THRAK
www.polarlava.com

Replies are listed 'Best First'.
Re: Reducing Array Interations
by blakem (Monsignor) on Aug 16, 2001 at 01:55 UTC
    Try replacing the main loop in your code with:
    my $patt = join('|', map {"(\Q$_\E)"} @text_remove); my @tmp; s/^$patt//o ? $tmp[-1] .= $_ : push(@tmp,$_) for @xlate_data; @xlate_data = @tmp;
    Sorry, I golfed it a bit....

    Update: Ok, slight explanation....

    We generate a pattern that will match all (any??) of the elements in @text_remove. In general, this is better than looping over each pattern individually. For one thing, it allows you to use the /o modifier to avoid recompiling the regex over and over again.

    We then loop through the @xlate_data array looking for this pattern. If we find it, we lop off whatever matched and glue whats left onto the last field in our temporary array (thats what $tmp[-1] .= $_ does). On the other hand, without a match we simple push the value onto the temporary array.

    Finally we overwrite @xlate_data with the @tmp array. This probably isn't how I would have coded it, but since the answer overwrites the data in the original script, I did it here as well. (I also already munged @xlate_data inplace, since I knew I was going to trash it anyway)

    -Blake

      This is definately cleaner than my approach, thus why I asked the question (and why I will use it). But I ran some benchmarks on just this section of code (derived from a 260K file) and it's not any faster. This test case is much larger than most of what I would be passing through it so it's not terribly significant, but I thought your approach would be more efficient. Kind of interesting at least to me. Thanks for your assistance.

      Original:
      Time: 27 wallclock secs (25.88 usr + 0.00 sys = 25.88 CPU)
      Time: 26 wallclock secs (25.75 usr + 0.00 sys = 25.75 CPU)
      Time: 27 wallclock secs (25.73 usr + 0.00 sys = 25.73 CPU)
      Time: 27 wallclock secs (25.88 usr + 0.00 sys = 25.88 CPU)
      Time: 26 wallclock secs (25.75 usr + 0.00 sys = 25.75 CPU)

      New:
      Time: 26 wallclock secs (25.97 usr + 0.00 sys = 25.97 CPU)
      Time: 26 wallclock secs (26.13 usr + 0.00 sys = 26.13 CPU)
      Time: 26 wallclock secs (26.39 usr + 0.00 sys = 26.39 CPU)
      Time: 27 wallclock secs (26.30 usr + 0.00 sys = 26.30 CPU)
      Time: 27 wallclock secs (26.31 usr + 0.00 sys = 26.31 CPU)

      -THRAK
      www.polarlava.com
        Thats probably because the joined pattern is really overkill in this situation. With the gigantic pattern, we wind up examining each row for each pattern, which requires an inordinate amount of backtracking. If you know that all your tags at the front will be similiar (like the ones in the sample data are) you wind up doing a lot more work than necessary. See if the code below runs any faster...

        my @tmp; my %text_remove = map {$_=>1} @text_remove; for (@xlate_data) { s/^(\[TEXT\-\d+\])//; $text_remove{$1} ? $tmp[-1] .= "$_" : push(@tmp,"$1$_"); } @xlate_data = @tmp;

        Here, we grab the tag off the front and compare it with a hash of the "special" tags. This should be much quicker than running a gigantic pattern against the data.

        -Blake

Re: Reducing Array Interations
by arturo (Vicar) on Aug 16, 2001 at 02:40 UTC

    Hmm ... if you know that the key parts of your data are at the beginning of each element, and that every line begins with some sort of tag along the lines your sample data has, the following trick might work reasonably well (provided you have a short list of keys to remove). No claims about efficiency are made off the top of my head, but you'll iterate through the array only once with this:

    # build a regex to match just your keys my $removers = "^(?:" . join ("|", map { quotemeta $_ } @text_remove) +. ") "; # print $removers out if you're not getting what you expect # strip the patterns from the beginning of each line. @xlate_data = map { s/$remove//; $_; } @xlate_data;

    OK, that's a start ... the funky bits have been *removed* from the start of each line that matches. Now, how to pack the array up? I don't like the idea of modifying the array in place; so why don't we just join the elements of the array into a string and use a regex on that string to remove the delimiter in front of an element that *doesn't* begin with a "[" character?

    # use whatever delimiter makes sense here. I'm assuming # you don't have newlines at the ends of the elements already. my $full_string = join "\n", @xlate_data; # since you know all the [ are gone from # the elements that matched, and that they will be present # in the ones that didn't match, a newline followed by # something other than a [ is a line you want to join. # so strip \n's not followed by a [ $full_string =~ s/\n([^[])/$1/g; # back to an array @xlate_data = split/\n/, $full_string;

    Yeah, that seems kinda wacky to me too =) But something like that might do the trick, as long as your data is well-behaved.

    HTH!

    <code> perl -e 'print "How sweet does a rose smell? "; chomp ($n = <STDIN>); $rose = "smells sweet to degree $n"; *other_name = *rose; print "$other_name\n"' </code
      arturo, I unfortunately have some other things I need to work on now, but I will ponder your input further and let you know if I find anything interesting/useful. Thank you.

      -THRAK
      www.polarlava.com
Re: Reducing Array Interations
by I0 (Priest) on Aug 16, 2001 at 02:34 UTC
    my @text_remove = ('[TEXT-2]', '[TEXT-4]', '[TEXT-7]'); my @xlate_data = ('[TEXT-1] Data Test ', '[TEXT-2] Second line tacked +onto #1', '[TEXT-3] Line #3', '[TEXT-4] 4th line goes with #3', '[TEX +T-5] five-five-FIVE', '[TEXT-6] Five was a Zappa reference...', '[TEX +T-7] This is a longer entry. But really not much different. I should + be part of #6'); my %text_remove; @text_remove{@text_remove} = @text_remove; foreach my $xlate_pos (1..$#xlate_data) { my $data = $xlate_data[$xlate_pos]; if( $text_remove{($data=~/^(\[[\w-]+\])/)[0]} ){ $data =~ s/^\[[\w-]\]\s*/ /; $xlate_data[$xlate_pos -1] .= $data; $xlate_data[$xlate_pos] = ''; } } @xlate_data = grep{$_}@xlate_data; foreach (@xlate_data) { print "NOW: $_\n"; }