Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: search for particular elements of hash with multiple values

by karlgoethebier (Abbot)
on Apr 19, 2017 at 17:58 UTC ( [id://1188302]=note: print w/replies, xml ) Need Help??


in reply to search for particular elements of hash with multiple values

MCE_Loop might also be an option:

#!/usr/bin/env perl use warnings; use strict; use MCE::Loop; use feature qw(say); my $cpus = MCE::Util->get_ncpu() || 4; MCE::Loop::init { max_workers => $cpus, }; my %barcode_hash = ( 1 => [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_CCCC', 0 +], 2 => [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_AAAA', 0 +], 3 => [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_BBBB', 0 +], 4 => [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_AAAA', 0 +], ); my $barcode_pair_35 = 'TTTT_AAAA'; mce_loop { my ( $mce, $chunk_ref, $chunk_id) = @_; for (@$chunk_ref) { if ( $barcode_hash{$_}[2] eq $barcode_pair_35 ) { say qq(Found $barcode_hash{$_}[2] at $_); } } } keys %barcode_hash; __END__

Update:

Ok, some benchmarking.

Playing around with $size might be worth the effort. Your mileage may vary. I hope i quoted haukex right and jumped to the right conclusions.

#!/usr/bin/env perl use MCE::Loop; use Benchmark qw ( :hireswallclock cmpthese timethese ); use strict; use warnings; use feature qw(say); my $size = 10000; say $size; my $cpus = MCE::Util->get_ncpu() || 4; MCE::Loop::init { max_workers => $cpus, chunk_size => $size }; my $data = [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_CCCC', +0 ]; our %barcode_hash = map { $_ => $data } 1 .. 99998; $barcode_hash{99999} = [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_AAAA ', 0 ]; $barcode_hash{100000} = [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_AAAA ', 0 ]; our $barcode_pair_35 = 'TTTT_AAAA'; my $results = timethese( -10, { 'karl' => 'karl', 'haukex' => 'haukex', } ); cmpthese($results); sub haukex { our %barcode_hash; our $barcode_pair_35; for my $key ( sort keys %barcode_hash ) { 1 if $barcode_hash{$key}[2] eq $barcode_pair_35; } } sub karl { our %barcode_hash; our $barcode_pair_35; mce_loop { my ( $mce, $chunk_ref, $chunk_id ) = @_; for (@$chunk_ref) { 1 if ( $barcode_hash{$_}[2] eq $barcode_pair_35 ); } } keys %barcode_hash; } __END__ haukex 6.74/s -- -47% karl 12.8/s 90% --

Update 2: Shit! If i omit the sort i lose...

Update 3: Slightly different picture with 1.000.000 keys and calculating them before benchmarking:

my $size = 10000; my $cpus = MCE::Util->get_ncpu() || 4; MCE::Loop::init { max_workers => $cpus, chunk_size => $size }; our $max = scalar keys %barcode_hash; sub haukex { our %barcode_hash; our $barcode_pair_35; our $max; for ( 1.. $max ) { 1 if $barcode_hash{$_}[2] eq $barcode_pair_35; } } sub karl { our %barcode_hash; our $barcode_pair_35; our $max; mce_loop { my ( $mce, $chunk_ref, $chunk_id ) = @_; for (@$chunk_ref) { 1 if ( $barcode_hash{$_}[2] eq $barcode_pair_35 ); } } 1..$max; } haukex 2.29/s -- -38% karl 3.67/s 60% --

Update 4: It's worth to install Sereal::Decoder.

Regards, Karl

«The Crux of the Biscuit is the Apostrophe»

Furthermore I consider that Donald Trump must be impeached as soon as possible

Replies are listed 'Best First'.
Re^2: search for particular elements of hash with multiple values
by marioroy (Prior) on Apr 20, 2017 at 21:54 UTC

    Update: This post was made to showcase another way to benchmark when involving parallel workers. Upon further review, the OP's post wants to increment $bc_pair_num which isn't done here. I will come back later and update the code. Also, sorting is not required when Perl has two fast ordered-hash implementations. I will try Hash::Ordered and MCE::Shared::Ordhash (constructed as non-shared). These modules are fast.

    Update: The OP does not mention sorting. Therefore please disregard the mentioning of the two ordered-hash implementations. Like haukex said, sorting is helpful during debugging sessions. This is true.

    Update: Posted a non-benchmark version, which increments the $bc_pair_num field.

    Greetings,

    Sometimes workers may not stop immediately like you want them to with various benchmark modules. The following is another way, based on karlgoethebier's example. For an array (mce_loop), the manager process chunks and sends the next chunk via IPC. For sequence (mce_loop_s) and with the bounds_only option, workers compute the next offset begin and end boundaries. Thus, runs with lesser overhead.

    If you must set the chunk_size option, do not go over 8000 when processing an array. Perl performance degrades if you go higher. Better yet, simply comment out the chunk_size option or not set it. There's no easy formula for setting chunk_size. However, the default chunk_size => 'auto' for MCE Models do a good job for most cases.

    For arrays, check to see if Perl has Sereal::Decoder and Sereal::Encoder 3.015 or later installed. MCE will use Sereal 3.015+ for serialization if available. Otherwise, it defaults to Storable. The results are mind-boggling. The reason is that for arrays, MCE involves the manager process which chunks the array elements and sends via IPC. Even with that overhead, MCE runs faster. That requires the our vs my keyword on %barcode_hash and $barcode_pair_35. Thank you, karlgoethebier for this enlightenment.

    Results: run 50 times for 100000 keys: $max set to 1e5.

    $ perl demo.pl haukex 50 duration (haukex): 1.236 seconds found match: yes $ perl demo.pl karlary 50 duration (karlary): 0.825 seconds found match: yes $ perl demo.pl karlseq 50 duration (karlseq): 0.313 seconds found match: yes

    Results: run 50 times for 1 million keys: $max set to 1e6.

    $ perl demo.pl haukex 50 duration (haukex): 17.388 seconds found match: yes $ perl demo.pl karlary 50 duration (karlary): 7.633 seconds found match: yes $ perl demo.pl karlseq 50 duration (karlseq): 2.858 seconds found match: yes

    Demo script.

    #!/usr/bin/env perl use strict; use warnings; use feature qw( say ); use MCE::Loop; use Time::HiRes qw( time ); sub usage { warn "usage: $0 ( haukex | karlary | karlseq ) [ count ]\n\n"; exit 1; } my $func = shift || usage(); my $count = shift || 50; usage() unless main->can($func); my $cpus = MCE::Util->get_ncpu() || 4; my $max = 100000; MCE::Loop::init { max_workers => $cpus, chunk_size => 8000, # <-- do not go over 8000 bounds_only => 1 # <-- applies to sequence }; my $data = [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_CCCC', 0 ]; our %barcode_hash = map { $_ => $data } 1 .. $max - 2; $barcode_hash{ ($max - 1) } = [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_AAAA', 0 ]; $barcode_hash{ ($max) } = [ 'AGCTCGTTGTTCGATCCA', 'GAGAGATAGATGATAGTG', 'TTTT_AAAA', 0 ]; our $barcode_pair_35 = 'TTTT_AAAA'; { no strict 'refs'; my $start = time; my $ret; $ret = $func->() for 1 .. $count; printf "duration ($func): %0.03f seconds\n", time - $start; printf "found match: %s\n", $ret ? 'yes' : 'no'; } exit 0; sub haukex { # serial code my $ret = 0; for ( 1 .. $max ) { $ret = 1, last if $barcode_hash{$_}[2] eq $barcode_pair_35; } return $ret; } sub karlary { # workers receive next array chunk my @ret = mce_loop { my ( $mce, $chunk_ref, $chunk_id ) = @_; for ( @$chunk_ref ) { MCE->gather(1), MCE->abort(), last if ( $barcode_hash{$_}[2] eq $barcode_pair_35 ); } } 1 .. $max; # <-- for array 1 .. $max return @ret ? 1 : 0; } sub karlseq { # workers receive next sequence 'begin' and 'end' boundaries my @ret = mce_loop_s { my ( $mce, $chunk_ref, $chunk_id ) = @_; for ( $chunk_ref->[0] .. $chunk_ref->[1] ) { MCE->gather(1), MCE->abort(), last if ( $barcode_hash{$_}[2] eq $barcode_pair_35 ); } } 1, $max; # <-- for sequence 1, $max return @ret ? 1 : 0; }

    For the MCE bits, I used MCE->gather and MCE->abort. The abort method is helpful which stops all workers from processing more chunks. Thus, ending the job early.

    Update: Results from a Windows 7 VM configured with 4 cores and Strawberry Perl. I think Perl makes extra copies. Thus, involves extra time during spawning.

    Results: run 50 times for 100000 keys: $max set to 1e5.

    $ perl demo.pl haukex 50 duration (haukex): 1.232 seconds found match: yes $ perl demo.pl karlary 50 duration (karlary): 1.482 seconds found match: yes $ perl demo.pl karlseq 50 duration (karlseq): 0.858 seconds found match: yes

    Results: run 50 times for 1 million keys: $max set to 1e6.

    $ perl demo.pl haukex 50 duration (haukex): 20.108 seconds found match: yes $ perl demo.pl karlary 50 duration (karlary): 16.770 seconds found match: yes $ perl demo.pl karlseq 50 duration (karlseq): 11.419 seconds found match: yes

    Update: Also tested Perl from the Cygwin environment. Here, it seems workers are spawned instantly after the initial creation. This is see via the task manager.

    Results: run 50 times for 100000 keys: $max set to 1e5.

    $ perl demo.pl haukex 50 duration (haukex): 1.607 seconds found match: yes $ perl demo.pl karlary 50 duration (karlary): 1.529 seconds found match: yes $ perl demo.pl karlseq 50 duration (karlseq): 0.749 seconds found match: yes

    Results: run 50 times for 1 million keys: $max set to 1e6.

    $ perl demo.pl haukex 50 duration (haukex): 25.194 seconds found match: yes $ perl demo.pl karlary 50 duration (karlary): 14.446 seconds found match: yes $ perl demo.pl karlseq 50 duration (karlseq): 7.051 seconds found match: yes

    Regards, Mario.

      To display the location, I changed the code accordingly and run 1 time. Running serially seems fast enough. Basically, Perl completes in less than 1 second for 1 million keys from an Intel i7 Haswell chip running at 2.6 GHz. Nethertheless, this post was written to demonstrate workers sending data to the manager-process for STDOUT via MCE->print(...).

      Results: run 1 and 2 times for 1 million keys: $max set to 1e6.

      $ perl demo2.pl haukex 1 Found at 999999 Found at 1000000 duration (haukex): 0.336 seconds $ perl demo2.pl karlary 1 Found at 999999 Found at 1000000 duration (karlary): 0.231 seconds $ perl demo2.pl karlseq 1 Found at 999999 Found at 1000000 duration (karlseq): 0.122 seconds
      $ perl demo2.pl haukex 2 Found at 999999 Found at 1000000 Found at 999999 Found at 1000000 duration (haukex): 0.665 seconds $ perl demo2.pl karlary 2 Found at 999999 Found at 1000000 Found at 999999 Found at 1000000 duration (karlary): 0.477 seconds $ perl demo2.pl karlseq 2 Found at 999999 Found at 1000000 Found at 999999 Found at 1000000 duration (karlseq): 0.253 seconds

      Below, workers report the location.

      ... { no strict 'refs'; my $start = time; $func->() for 1 .. $count; printf "duration ($func): %0.03f seconds\n", time - $start; } ... sub haukex { # serial code for my $key ( 1 .. $max ) { print "Found at $key\n" if ( $barcode_hash{$key}[2] eq $barcode_pair_35 ); } return; } sub karlary { # workers receive next array chunk mce_loop { my ( $mce, $chunk_ref, $chunk_id ) = @_; for my $key ( @$chunk_ref ) { MCE->print("Found at $key\n") if ( $barcode_hash{$key}[2] eq $barcode_pair_35 ); } } 1 .. $max; # <-- for array 1 .. $max return; } sub karlseq { # workers receive next sequence 'begin' and 'end' boundaries mce_loop_s { my ( $mce, $chunk_ref, $chunk_id ) = @_; for my $key ( $chunk_ref->[0] .. $chunk_ref->[1] ) { MCE->print("Found at $key\n") if ( $barcode_hash{$key}[2] eq $barcode_pair_35 ); } } 1, $max; # <-- for sequence 1, $max return; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1188302]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-19 02:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found