wollmers has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

as this part takes ~15% of an fast alignment algorithm, it would be nice to have a faster way.

my $Y = [ qw( a b a) ]; my $YPos; my $index; for ( $index = 0 ; $index <= $#$Y ; $index++ ) { push ( @{ $YPos->{$Y->[$index]} }, $index ); } # now $YPos should contain my $result = { a => [ 0, 2 ], b => [ 1 ], };

TIA (Thanks In Advance)

Helmut Wollmersdorfer

Replies are listed 'Best First'.
Re: Faster indexing an array
by CountZero (Bishop) on Sep 19, 2014 at 21:06 UTC
    Replacing your C-style loop by
    push @{$YPos->{$_}}, $index++ for @$Y;
    should be a bit faster as it eliminates one step of indexing.

    As an aside, why do you use scalars and references rather than arrays and hashes?

    Update:

    I ran your code on an array of 10 million elements. It took 27 seconds. My code finished in 20 seconds. Using arrays and hashes instead of scalars and references made it run in 19 seconds, another 5% saved!

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      Thanks a lot.

      Davidos solution is faster than mine, and LanX' and CountZeros is the fastest.

      For some reason Devel::NYTProf gives better results for C-style loops over 'for my $i (...)'.

      The hashref is a relict from having this part in an sub, then refactored down, then inlined, now 1-lined. See here the same logic in Algorithm::Diff:

      sub _withPositionsOfInInterval { my $aCollection = shift; # array ref my $start = shift; my $end = shift; my $keyGen = shift; my %d; my $index; for ( $index = $start ; $index <= $end ; $index++ ) { my $element = $aCollection->[$index]; my $key = &$keyGen( $element, @_ ); if ( exists( $d{$key} ) ) { unshift ( @{ $d{$key} }, $index ); } else { $d{$key} = [$index]; } } return wantarray ? %d : \%d; }

      Having an arrayref comes from the two input-parameters, sequences X, Y or sometimes also called A, B in the descriptions of diff/LCS/align-algorithms.

      Helmut Wollmersdorfer

        The if(exists...) test isn't necessary and can be optimized away!

        DB<119> unshift @{ $d{key} }, 'bla' => 1 DB<120> \%d => { key => ["bla"] }

        Cheers Rolf

        (addicted to the Perl Programming Language and ☆☆☆☆ :)

        you didn't tell us that you need only an interval out of the array, the fastest approach with my version of Perl is iterating over an interval or array slice

        DB<173> use Time::HiRes qw/time/ DB<174> @x=();$t=time; for (my $i=$start; $i<=$end;$i++) { push @x, +$a[$i]}; print time-$t 0.21531081199646 DB<175> @y=(); $t=time; push @y, $_ for @a[$start..$end]; print time +-$t 0.1353440284729 DB<176> @z=();$t=time; push @z, $a[$_] for $start..$end; print time- +$t 0.142512083053589

        (you are welcome to do proper benchmarking =)

        but optimizing or inlining &$keyGen( $element, @_ ) should lead to much more efficiency, since 25% of 15% doesn't count much!!!

        update

        of course you could directly iterate over the values of a hash slice of an array slice... :)

        DB<180> %h=(); $i=$start; $t=time; push @$_, $i++ for @h{@a[$start.. +$end]}; print time-$t => 1 0.427901983261108

        BTW: @a=0..1e6;$start=1e5;$end=2*$start;

        Cheers Rolf

        (addicted to the Perl Programming Language and ☆☆☆☆ :)

      I for one, tend to use scalars and refs because that makes using the variables more consistent (arrows for everybody!). It makes for less effort to pass around the refs to helper functions. I'll still use a direct array variable for something very local and/or temporary, but in general it'll be refs.

      These days, I often go all the way to a $universal hashref, which makes it almost trivial to save and restore the program state.

      PS:
      Such a $universal hashref technique is not recommended for anyone who has issues with spelling key names consistently and accurately.

        (1) I suggest using %hash and @array for local use, and if you need to pass it around, take a ref to it... I'll often have a routine that creates a %hash and then does a "return \%hash;". The reasoning is that the sigils are something like built-in hungarian notation, and you might as well take advantage of them when you can. (By the way: Conway recommends using a "_ref" suffix on references, but my take is that's either too heavy-handed or doesn't go far enough. If you're going to say "_ref" you might as well specify which it is: "_aref" or "_href". If all you care about is whether it's a reference or a scalar, then use plural or singular names: "$items" or "$item".

        (2) But the real reason I'm writing is this remark: "These days, I often go all the way to a $universal hashref, which makes it almost trivial to save and restore the program state."

        (a) 'tis true that hash field names get autovivified and hence defeat "use strict", but if you have a known set of allowed names, you can strictify them yourself, using the "lock_keys" routine from Hash::Util.

        my %uni = map{ $_ => undef } @allowed_names; Hash::Util::lock_keys( %hash );
        (b) but you're reinventing perl objects. You might as well just do objects... then if you're using something moose-like, like Mouse or Moo, the "allowed names" get turned into a list of "has" lines.
Re: Faster indexing an array
by davido (Cardinal) on Sep 19, 2014 at 21:03 UTC

    I'll let you do the benchmarking. However, I suspect that using a range-based for loop will be faster than a C-style for loop, because it pushes more work out of Perl and into perl.

    push @{ $z->{$y->[$_]} }, $_ for 0 .. $#$y;

    If that's still not fast enough, rewrite the alignment algorithm with Inline::C. :) You're currently chasing after possibly halving the runtime of a segment of code that consumes 15% of a tight algorithm. Let's say you could completely optimize out this tranform. That's still only a 15% savings. If, instead, you're able to cut it in half, that's a 7.5% savings overall. Rewrite the entire algorithm in C using Inline::C, and you might be able to shave half off the algorithm's runtime, rather than half off of a small portion of the algorithm's runtime.


    Dave

Re: Faster indexing an array
by LanX (Saint) on Sep 19, 2014 at 21:05 UTC
    don't know if it's that much faster but it's IMHO better readable (i.e. more "perlish") =)

    DB<112> @Y=qw(a b a) => ("a", "b", "a") DB<113> %Ypos=(); $i=0; => 0 DB<114> push @{ $Ypos{$_} },$i++ for @Y => "" DB<115> \%Ypos => { a => [0, 2], b => [1] }

    please note that the debugger-shell has problems to handle lexical variables, that's why I omitted them.

    And I avoided dereferencing, but I don't think it'll cost you much reintroducing it. (if needed)

    But you should add my declarations in productive code.

    Cheers Rolf

    (addicted to the Perl Programming Language and ☆☆☆☆ :)

    PS: from 5.14 on you can also write

    push $Ypos{$_},$i++ for @Y

Re: Faster indexing an array
by johngg (Canon) on Sep 19, 2014 at 21:32 UTC

    Others beat me to it but here's a benchmark.

    use strict; use warnings; use 5.014; use Benchmark qw{ cmpthese }; my @letters = ( q{a} .. q{z} ); my $raTest = [ qw{ a b a } ]; my $raReal = [ map { $letters[ rand @letters ] } 1 .. 100000 ]; my $rhRes1 = wollmers( $raTest ); my $rhRes2 = johngg( $raTest ); say do { local $" = q{, }; qq{$_ => [ @{ $rhRes1->{ $_ } } ]} } for sort keys %$rhRes1; say do { local $" = q{, }; qq{$_ => [ @{ $rhRes1->{ $_ } } ]} } for sort keys %$rhRes2; cmpthese( -10, { johngg => sub { my $ret = johngg( $raReal ) }, wollmers => sub { my $ret = wollmers( $raReal ) }, } ); sub wollmers { my $aref = shift; my $index; my $rhRet; for ( $index = 0; $index <= $#$aref; $index ++ ) { push @{ $rhRet->{ $aref->[ $index ] } }, $index; } return $rhRet; } sub johngg { my $aref = shift; my $idx = 0; my $rhRet; push @{ $rhRet->{ $_ } }, $idx ++ for @{ $aref }; return $rhRet; }

    The output.

    a => [ 0, 2 ] b => [ 1 ] a => [ 0, 2 ] b => [ 1 ] Rate wollmers johngg wollmers 17.6/s -- -26% johngg 24.0/s 36% --

    A modest increase in speed so perhaps the Inline::C suggestion will be required.

    Cheers,

    JohnGG

Re: Faster indexing an array
by LanX (Saint) on Sep 19, 2014 at 22:29 UTC
    > takes ~15% of an fast alignment algorithm

    you should show us more of the whole algorithm.

    I'm pretty sure our solutions for this isolated part can't be done faster in pure Perl.

    But probably you just need another general tactic.

    Cheers Rolf

    (addicted to the Perl Programming Language and ☆☆☆☆ :)

      I need something similar as result as sdiff() of Algorithm::Diff provides.

      $sequence1 = [ qw( a b ) ]; $sequence2 = [ qw( b c ) ]; # the longest common subsequence of it $LCS_index = [[ 1, 1 ]]; # aligned $result = [ [ 'a', '' ], [ 'b', 'b' ], [ '', 'c' ], ]; $stringified = [ 'ab_', '_bc', ];

      The two most popular of the fastest algorithms for LCS are Hunt/McIllroy (used in Algorithm::Diff, Algorithm::LCS from BackPAN written in XS) and Meyers/Ukkonen (used in GNU-diff, String::Similarity).

      What I implemented is an improved Hunt/McIllroy from AFROZA BEGUM, A GREEDY APPROACH FOR COMPUTING LONGEST COMMON SUBSEQUENCES, Journal of Prime Research in Mathematics Vol. 4(2008), 165-170. It beats A::D::sdiff(). To be fair A::D provides more functionality, which I also try to strip down for comparison. In the end I would like to modify the XS of A::LCS. A::LCS processes 0.8 Mio/s in comparison to 14 thousand/s A::D::sdiff() of length=10, edit-distance=4. But making the aligned hunks via perl from the LCS of A::LCS slows down to 35 thousand/s.

      A::LCS-aligned         35714.29/s (n=50000)
      lcs_greedy_aligned:  22831.05/s (n=50000)
      A::D::sdiff:               14204.55/s (n=50000)
      

      Here my code (dirty as it is work in progress):

      sub lcs_greedy { my $self = shift; my $X = shift; my $Y = shift; my $YPos; my $index = 0; push @{ $YPos->{$_} },$index++ for @$Y; my $Xmatches; for ( $index = 0 ; $index <= $#$X ; $index++ ) { if ( exists( $YPos->{$X->[$index]} ) ) { push ( @$Xmatches , $index ); } } my $Xcurrent = -1; my $Ycurrent = -1; my $m = $#$Xmatches; my $n = $#$Y; my @L = (); # LCS my $R = 0; # records the position of last selected symbol my $i = 0; my $Pi; my $Pi1; my $hunk; for ($i = 0; $i <= $m; $i++) { $hunk = []; $Pi = $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1; # Position in Y +of ith symbol $Pi1 = ($i < $m && defined $YPos->{$X->[$Xmatches->[$i+1]]}->[0]) ? $YPos->{$X->[$Xmatches->[$i+1]]}->[0] : -1; # Position in Y of + i + 1st symbol #print STDERR '$i: ',$i,' $Pi: ',$Pi,' $Pi1: ',$Pi1,' $R: ',$R,"\n"; while ($Pi1 < $R && $Pi1 > -1) { #print STDERR '$Pi1 < $R',"\n"; shift @{$YPos->{$X->[$Xmatches->[$i+1]]}}; $Pi1 = $YPos->{$X->[$Xmatches->[$i+1]]}->[0] // -1; } while ($Pi < $R && $Pi < $n+1) { #print STDERR '$Pi < $R',"\n"; shift @{$YPos->{$X->[$Xmatches->[$i]]}}; $Pi = $YPos->{$X->[$Xmatches->[$i]]}->[0] // $n+1; } if ($Pi > $Pi1 && $Pi1 > $R) { $hunk = [$Xmatches->[$i+1],$Pi1]; shift @{$YPos->{$X->[$Xmatches->[$i+1]]}}; $R = $Pi1; $i = $i+1; } elsif ($Pi < $n+1) { $hunk = [$Xmatches->[$i],$Pi]; shift @{$YPos->{$X->[$Xmatches->[$i]]}}; $R = $Pi; } if (scalar @$hunk) { while ($Xcurrent+1 < $hunk->[0] || $Ycurrent+1 < $hunk->[1] ) { my $Xtemp = ''; my $Ytemp = ''; if ($Xcurrent+1 < $hunk->[0]) { #$Xtemp = $Xcurrent+1; $Xtemp = $X->[$Xcurrent+1]; $Xcurrent++; } if ($Ycurrent+1 < $hunk->[1]) { #$Ytemp = $Ycurrent+1; $Ytemp = $Y->[$Ycurrent+1]; $Ycurrent++; } push @L,[$Xtemp,$Ytemp]; } $Xcurrent = $hunk->[0]; $Ycurrent = $hunk->[1]; #push @L,$hunk; # indices push @L,[$X->[$Xcurrent],$Y->[$Ycurrent]]; # elements } } while ($Xcurrent+1 <= $#$X || $Ycurrent+1 <= $#$Y ) { my $Xtemp = ''; my $Ytemp = ''; if ($Xcurrent+1 <= $#$X) { #$Xtemp = $Xcurrent+1; $Xtemp = $X->[$Xcurrent+1]; $Xcurrent++; } if ($Ycurrent+1 <= $#$Y) { #$Ytemp = $Ycurrent+1; $Ytemp = $Y->[$Ycurrent+1]; $Ycurrent++; } push @L,[$Xtemp,$Ytemp]; } return \@L; }