coldy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have an array
@array = ("x y.g z 123", "a b.f c 456", "d b.c w 321")
Each string in @array is always of the same form but with different letters and numbers. If I split the strings by white space, then the important elements are the second ones - ie "y.g", "b.f" and "b.c".

More specifically, what is important is the first component of these strings, ie "y", "b" and "b"

Now I need to filter @array based on these components mentioned above. The filtering is according to another array

@include = ("b","q")
so since $array[0] does not contain a "b" or "q" after splitting and looking at the second element it is removed.

I can do this with a series of splits, and testing the components exist in a test hash

my %testHash = map { $_,1} @include; foreach my $tmp (@array){ my @tokens = split /\s+/,$tmp; my ($test,$rubbish) = split /\./, $tokens[1]; push (@keep, $tmp) if (exists $testHash{$test}); }
But it is very slow when repeated on many different arrays. Is there are more efficient way to use consecutive splitting and a map function? Maybe somebody can see a more elegant way of solving this problem? Many thanks, Chris.

Replies are listed 'Best First'.
Re: filter an array with consecutive splits
by desemondo (Hermit) on Jan 04, 2010 at 02:52 UTC
    Great question Chris!

    This sounds like a good candidate for profiling. I myself am fairly new to profiling... and found the following with Devel::SmallProf. I populated @array with 1400 elements, I don't know if that's enough to get meaningful results. However I did see a difference below...

    If you're able to, run your real data through one of the profiling modules. Using real data yeilds more meaningful results when you do profiling. Then you'll have a better handle on what is slowing things down, and what might be possible to write differently.

    First hash of your code to get a working version...
    use strict; use warnings; my @array = ("x y.g z 123", "a b.f c 456","a b.f c 456" ); #added lo +ts more on the runs through smallprof. my @include = ("b","q"); my @keep; my %testHash = map { $_,1} @include; for my $tmp (@array){ my @tokens = split /\s+/,$tmp; my ($test,$rubbish) = split /\./, $tokens[1],2; if (exists $testHash{$test}) { push (@keep, $tmp); }; } print keys %testHash , "\n"; print "@keep";
    Results
    ================ SmallProf version 2.02 ================ Profile of main.pl + Page 1 =============================================================== +== count wall tm cpu time line 0 0.00000 0.00000 1:use strict; 0 0.00000 0.00000 2:use warnings; 0 0.00000 0.00000 3: 1 0.00091 0.00000 4:my @array = ("x y.g z 123", "a b +.f c 456", 1 0.00000 0.00000 5:my @include = ("b","q"); 1 0.00000 0.00000 6:my @keep; 0 0.00000 0.00000 7: 0 0.00000 0.00000 8: 1 0.00001 0.00000 9:my %testHash = map { $_,1} @includ +e; 0 0.00000 0.00000 10: 1 0.00000 0.00000 11:for my $tmp (@array){ 1404 0.00507 0.07800 12: my @tokens = split /\s+/,$ +tmp; 1404 0.00034 0.03200 13: my ($test,$rubbish) = split +/\./, 1404 0.00282 0.01500 14: if (exists $testHash{$test +}) { 0 0.00000 0.00000 15: push (@keep, $tmp); 0 0.00000 0.00000 16: }; 0 0.00000 0.00000 17:} 0 0.00000 0.00000 18: 1 0.00013 0.00000 19:print keys %testHash , "\n"; 1 0.10416 0.00000 20:print "@keep";
    From this it appears that split is responsible for the majority of time required for this loop. I also found out that split has an optional 4th argument LIMIT which is a positive integer starting from 1 for the number of fields to return. I suspect split was/is uslessly returning the extra characters and simply throughing them into the void. This however is only beneficial if you're dealing with millions of elements and not 1400 as per this profile run. Otherwise it'll save barely a couple of milliseconds...

    Modified version with my ($test,$rubbish) = split /\./, $tokens[1],2; having it's LIMIT set to 2.

    ================ SmallProf version 2.02 ================ Profile of main.pl + Page 1 =============================================================== +== count wall tm cpu time line 0 0.00000 0.00000 1:use strict; 0 0.00000 0.00000 2:use warnings; 0 0.00000 0.00000 3: 1 0.00092 0.00000 4:my @array = ("x y.g z 123", "a b +.f c 456", 1 0.00000 0.00000 5:my @include = ("b","q"); 1 0.00000 0.00000 6:my @keep; 0 0.00000 0.00000 7: 0 0.00000 0.00000 8: 1 0.00001 0.00000 9:my %testHash = map { $_,1} @includ +e; 0 0.00000 0.00000 10: 1 0.00000 0.00000 11:for my $tmp (@array){ 1404 0.00347 0.06300 12: my @tokens = split /\s+/,$ +tmp; 1404 0.00007 0.03000 13: my ($test,$rubbish) = split +/\./, 1404 0.00046 0.00000 14: if (exists $testHash{$test +}) { 0 0.00000 0.00000 15: push (@keep, $tmp); 0 0.00000 0.00000 16: }; 0 0.00000 0.00000 17:} 0 0.00000 0.00000 18: 1 0.00019 0.00000 19:print keys %testHash , "\n"; 1 0.11720 0.00000 20:print "@keep";
      Just for grins, I took a stab at performance profiling of several solutions. I took out the final print statements. And it could be that my transcription contains some errors. But here are the results that I got with 4 runs:
      #!usr/bin/perl -w use strict; use Benchmark; timethese (1000000, { jwkrahn => q{ my @array = ( "x y.g z 123", "a b.f c 456", "d b.c w 321" ); my @include = ( "b", "q" ); my @keep = map { my $item = $_; my $key = ( split )[ 1 ]; map {$key =~ +/\Q$_/ ? $item : ()} @include } @array; }, AnomalousMonk => q{ my @array = ('x y.g z 123', 'a b.f c 456', 'd b.c w 321'); my @include = qw(b q); my ($include) = map qr{ \b [$_] \b }xms, join '', @include; my $second = qr{ \b [[:lower:]] \b }xms; my $element = qr{ $include \. $second }xms; my @keep = grep m{ $element }xms, @array; }, Marshall => q{ my @array = ("x y.g z 123", "a b.f c 456", "d b.c w 321"); my @include = ("b","q"); my %include = map { $_ => 1 } @include; my @keep; foreach my $row (@array) { my $important_letter = ($row =~ m/\s+([a-z])\./)[0]; push (@keep, $row) if ($include{$important_letter}); } }, desemondo => q{ @array = ("x y.g z 123", "a b.f c 456", "d b.c w 321"); @include = ("b","q"); my %testHash = map { $_,1} @include; foreach my $tmp (@array) { my @tokens = split /\s+/,$tmp; my ($test,$rubbish) = split /\./, $tokens[1]; push (@keep, $tmp) if (exists $testHash{$test}); } } } ); __END__ Benchmark: timing 1000000 iterations of AnomalousMonk, Marshall, desemondo, jwkrahn... AnomalousMonk: 27 wallclock secs (26.42 usr +0.16 sys = 26.58 CPU)@37625.10/s Marshall: 17 wallclock secs (17.38 usr + 0.00 sys = 17.38 CPU)@57553.96/s desemondo: 40 wallclock secs (40.94 usr + 0.11 sys = 41.05 CPU)@24362.32/s jwkrahn: 77 wallclock secs (76.48 usr + 0.00 sys = 76.48 CPU)@13074.63/s Benchmark: timing 1000000 iterations of AnomalousMonk, Marshall, desemondo, jwkrahn... AnomalousMonk: 27 wallclock secs (26.45 usr + 0.09 sys = 26.55 CPU)@37669.04/s Marshall: 18 wallclock secs (17.36 usr + 0.00 sys = 17.36 CPU)@57607.01/s desemondo: 44 wallclock secs (44.05 usr + 0.14 sys = 44.19 CPU)@ 22631.09/s jwkrahn: 97 wallclock secs (97.92 usr + 0.02 sys = 97.94 CPU)@10210.65/s Benchmark: timing 1000000 iterations of AnomalousMonk, Marshall, desemondo, jwkrahn... AnomalousMonk: 27 wallclock secs (26.75 usr + 0.11 sys = 26.86 CPU)@37230.08/s Marshall: 18 wallclock secs (17.64 usr + 0.00 sys = 17.64 CPU)@56689.34/s desemondo: 47 wallclock secs (48.14 usr + 0.09 sys = 48.23 CPU)@20732.26/s jwkrahn: 106 wallclock secs (104.81 usr + 0.00 sys = 104.81 CPU)@9540.80/s Benchmark: timing 1000000 iterations of AnomalousMonk, Marshall, desemondo, jwkrahn... AnomalousMonk: 28 wallclock secs (27.27 usr + 0.08 sys = 27.34 CPU)@36569.76/s Marshall: 21 wallclock secs (20.39 usr + 0.00 sys = 20.39 CPU)@49043.65/s desemondo: 55 wallclock secs (54.77 usr + 0.11 sys = 54.88 CPU)@18223.23/s jwkrahn: 114 wallclock secs (115.05 usr + 0.00 sys = 115.05 CPU)@8692.10/s
      Devel::SmallProf should be considered deprecated (*) as it has been superseded by the new and much accurate and better Devel::NYTProf.

      (*) unless somebody shows me some profiling scenario that is still handled better by Devel::SmallProf (or Devel::FastProf) than by Devel::NYTProf

Re: filter an array with consecutive splits
by jwkrahn (Abbot) on Jan 04, 2010 at 02:56 UTC

    This may work better:

    $ perl -le' my @array = ( "x y.g z 123", "a b.f c 456", "d b.c w 321" ); my @include = ( "b", "q" ); my @keep = map { my $item = $_; my $key = ( split )[ 1 ]; map $key =~ +/\Q$_/ ? $item : (), @include } @array; print for @keep; ' a b.f c 456 d b.c w 321
Re: filter an array with consecutive splits
by AnomalousMonk (Archbishop) on Jan 04, 2010 at 03:02 UTC

    Assuming that the second character of an  a.b element is always a single lower case alpha, here's one solution:

    >perl -wMstrict -le "my @array = ('x y.g z 123', 'a b.f c 456', 'd b.c w 321'); my @include = qw(b q); my ($include) = map qr{ \b [$_] \b }xms, join '', @include; my $second = qr{ \b [[:lower:]] \b }xms; my $element = qr{ $include \. $second }xms; my @keep = grep m{ $element }xms, @array ; print qq{'$_'} for @keep; " 'a b.f c 456' 'd b.c w 321'

    Updates:

    • Changed final  print slightly.
    • Changed initialization of  $include scalar.
    • Changed definitions of  $include and  $second regexes to insure they can only match single alphas. This could be even more tightly defined by changing the second  \b boundary assertion in  $include to be a positive look-ahead to a  . (dot), and the first  \b boundary assertion in  $second to be a positive look-behind to a  . (dot).

Re: filter an array with consecutive splits
by Marshall (Canon) on Jan 04, 2010 at 03:12 UTC
    Perhaps?
    #!usr/bin/perl -w use strict; my @array = ("x y.g z 123", "a b.f c 456", "d b.c w 321"); my @include = ("b","q"); my %include = map { $_ => 1 } @include; my @keep; foreach my $row (@array) { my $important_letter = ($row =~ m/\s+([a-z])\./)[0]; push (@keep, $row) if ($include{$important_letter}); } foreach (@keep) { print "keep: $_\n"; } __END__ keep: a b.f c 456 keep: d b.c w 321
    I would keep things simple. You need some kind of looping statement. Then something to figure out whether this is a "good" row or not? Normally regex is faster than split. I used a slice to avoid $1. Then there is an "if" and a push if this row is "ok". The regex can become fancier if need be, or sometimes less fancy is ok too...($row =~ m/\s+(\w)\./)[0];