Hi rajaman,

I am appending below input and output file formats...

Great! I made two demonstrations entirely hash-key driven (2-levels). The serial code, based on ikegami's demonstration, may be fast enough for your use case. The parallel demonstration may run two times faster or more. Gather order is not necessary. Be sure to have Sereal installed for maximum performance.

Both demonstrations produce the same output.

Serial Code

#!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; my %hashunique; open RF, "<", $inputfile1 or die "Can't open $inputfile1: $!"; local $/ = ''; # blank line, paragraph break while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$#line +s; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\*(.*?)\*(.*?)\*D(?:\b)/g) { $hashunique{"D$1"}{$2} = undef; } } } close RF; # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF;

Parallel Code

#!/usr/bin/perl use strict; use warnings; use Sort::Naturally qw(nsort); use MCE; # This program reads an abstract sentence file and produces # output with the following format ... if ($#ARGV != 1) { print "usage: $0 <inputfile> <outputfile>\n"; } my $inputfile1 = $ARGV[0]; my $outputfile = $ARGV[1]; unless (-e $inputfile1) { die "Can't open $inputfile1: No such file or directory"; } # Gather routine for the manager process. my %hashunique; sub gather { my ($hashref) = @_; for my $k1 (keys %{$hashref}) { for my $k2 (keys %{$hashref->{$k1}}) { $hashunique{$k1}{$k2} = undef; } } } # The user function for MCE workers. Workers open a file handle to # a scalar ref due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my %localunique; open RF, '<', $slurp_ref; # A shared-hash is not necessary. The gist of it all is batching # to a local hash. Otherwise, a shared-hash inside a loop involves # high IPC overhead. local $/ = ''; # blank line, paragraph break # in the event worker receives 2 or more records while (<RF>) { my @lines = split /\n/, $_; # my ($indexofdashinarray) = grep { $lines[$_] =~ /\-\-/ } 0..$# +lines; for my $i (1..$#lines) { next if $lines[$i] eq '--'; while ($lines[$i] =~ m/(?:\b)D\*(.*?)\*(.*?)\*D(?:\b)/g) { $localunique{"D$1"}{$2} = undef; } } } close RF; # Call gather outside the loop. MCE->gather(\%localunique); } # Am using the core MCE API. Workers read the input file directly and # sequentially, one worker at a time. my $mce = MCE->new( max_workers => 4, input_data => $inputfile1, chunk_size => 1 * 1024 * 1024, # 1 MiB RS => '', # important, blank line, paragraph break gather => \&gather, user_func => \&user_func, use_slurpio => 1 ); $mce->run(); # Results. open WF, ">", $outputfile or die "Can't open $outputfile: $!"; foreach my $k (nsort keys %hashunique) { $hashunique{$k} = join '|', sort(keys %{$hashunique{$k}}); print WF "$k=>$hashunique{$k}\n"; } close WF;

Regards, Mario


In reply to Re^3: Parallel-processing the code by marioroy
in thread Parallel-processing the code by rajaman

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.