Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Multithreading a large file split to multiple files

by 10isitch (Initiate)
on May 14, 2018 at 21:22 UTC ( [id://1214504]=perlquestion: print w/replies, xml ) Need Help??

10isitch has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that splits a large 15Gb file into several smaller files. It only runs on single cpu core and takes over half an hour. Can I make it run on multiple cores so it runs faster? Here's a abbreviated version. Normally it would create 13 smaller files.

use strict; print "\nSTARTING $0...\n"; if (@ARGV != 3) { print "Usage: <BNH scenario file> <Equity scenario output> <IR + scenario output>\n"; exit -1; } my $infile = $ARGV[0]; my $eqfile = $ARGV[1]; my $irfile = $ARGV[2]; my (@arr,@arr2,$w1,$w2); unless (open(INFILE, "< $infile")) {print "Error: Can not open $infile +\n"; exit -1}; unless (open(EQFILE, "> $eqfile")) {print "Error: Can not open $eqfile +!\n"; exit -1}; unless (open(IRFILE, "> $irfile")) {print "Error: Can not open $irfile +!\n"; exit -1}; while (<INFILE>) { chomp $_; if ($_ =~ /ScenSet/) { print EQFILE "$_\n"; print IRFILE "$_\n";next; +} @arr = split/\,/; if ($arr[1]) { print EQFILE ",equity,Base Scenario,0,\n"; print IRFILE ",irate,Base Scenario,0,\n"; next; } if ($arr[2]) { my @arrn = split(/\_/,$arr[2]); print EQFILE ",,eq_".$arrn[1].",1,\n"; print IRFILE ",,ir_".$arrn[1].",1,\n"; next; } if ($arr[6]) {$w1 = 0;$w2 = 0;} # riskfactor contains the string "Index" will be identified as equity +riskfactor if ($_ =~ /Index/ && $_ !~ /Credit/) { $w1 = 1;} # Assumption for IR - there will be only riskfactor USDSWAP and USDTRE +A if ($_ =~ /USDSWAP/ || $_ =~ /USDTREA/) { $w2 = 1;} if ($w1) { print EQFILE "$_\n"; } if ($w2) { print IRFILE "$_\n"; } } close INFILE; close EQFILE; close IRFILE;

Replies are listed 'Best First'.
Re: Multithreading a large file split to multiple files
by BrowserUk (Patriarch) on May 14, 2018 at 22:07 UTC
    Can I make it run on multiple cores so it runs faster?

    Short answer: no.

    The logic of your code dictates the records in the input file are read in strict first to last sequence. Thus, any overhead from switching threads or sharing state is additional time to that required for processing.

    Even the code towards the end of the loop, is dependent on state changes earlier in that loop.

    And with 15GB of input, there isn't even any mileage in accumulating output in memory to avoid disk thrash.

    It's doubtful if even MCE can help you with this.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      I agree that multiple cores will not help because there is a blocking point of the sequential read of the input file.

      I am not so sure about output buffering. I really don't know in this situation, but depending upon the file system and other factors like the intelligence of the disk controller, increasing the buffer size for write could make a difference?
      PerlO::buffersize
      Just an idea to try. I would benchmark 64K vs standard size (which I guess is probably 4K) and see if there is any significant difference.

        Thank you for the suggestion. I'll try that out.
Re: Multithreading a large file split to multiple files
by marioroy (Prior) on May 15, 2018 at 02:17 UTC

    Hi 10isitch,

    Welcome to the monastery! Without a test sample of the input data, the MCE solution may or not work depending on whether or not processing multi-line records. Highlight: There is no data passing between the manager process and workers for input and output. This saves on IPC overhead. IO is read and written sequentially, not random.

    use strict; use warnings; use MCE; use IO::Handle; # for autoflush if (@ARGV != 3) { print "Usage: <BNH scenario file> <Equity scenario output> <IR sce +nario output>\n"; exit -1; } print "\nSTARTING $0...\n"; my ($infile, $eqfile, $irfile) = @ARGV; unless (-e $infile) { print "Error: Cannot open $infile!\n"; exit -1; } unless (open(EQFILE, ">", $eqfile)) { print "Error: Cannot open $eqfile!\n"; exit -1; } unless (open(IRFILE, ">", $irfile)) { print "Error: Cannot open $irfile!\n"; exit -1; } # Must enable autoflush whenever workers # write directly to output file handles. EQFILE->autoflush(1); IRFILE->autoflush(1); # The user function for MCE workers. # Workers open a file handle to a scalar ref # due to using MCE option use_slurpio => 1. sub user_func { my ($mce, $slurp_ref, $chunk_id) = @_; my ($eqbuf, $irbuf) = ('',''); my ($w1, $w2); open INFILE, "<", $slurp_ref; # The gist of it all is concatenation to buffer # string(s) while inside the loop. while (<INFILE>) { chomp $_; if ($_ =~ /ScenSet/) { $eqbuf .= "$_\n"; $irbuf .= "$_\n"; next; } my @arr = split/\,/; if ($arr[1]) { $eqbuf .= ",equity,Base Scenario,0,\n"; $irbuf .= ",irate,Base Scenario,0,\n"; next; } if ($arr[2]) { my @arrn = split(/\_/,$arr[2]); $eqbuf .= ",,eq_".$arrn[1].",1,\n"; $irbuf .= ",,ir_".$arrn[1].",1,\n"; next; } if ($arr[6]) { $w1 = $w2 = 0; } # Riskfactor contains the string "Index" will be identified # as equity riskfactor if ($_ =~ /Index/ && $_ !~ /Credit/) { $w1 = 1; } # Assumption for IR - there will be only riskfactor # USDSWAP and USDTREA if ($_ =~ /USDSWAP/ || $_ =~ /USDTREA/) { $w2 = 1; } if ($w1) { $eqbuf .= "$_\n"; } if ($w2) { $irbuf .= "$_\n"; } } close INFILE; # Workers write directly to output files sequentially # and orderly, one worker at a time inside the MCE::relay # block. Call this one time only and outside the loop. MCE::relay { print EQFILE $eqbuf if length($eqbuf); print IRFILE $irbuf if length($irbuf); }; return; } # Using the core MCE API. Workers read the input file # directly and sequentially, one worker at a time. MCE->new( max_workers => 3, input_data => $infile, chunk_size => 2 * 1024 * 1024, # 2 MiB use_slurpio => 1, init_relay => 0, # loads MCE::Relay user_func => \&user_func, )->run(); close EQFILE; close IRFILE;

    Regards, Mario

Re: Multithreading a large file split to multiple files
by pwagyi (Monk) on May 15, 2018 at 01:38 UTC
    Can you post pseudocode on how splitting logic works?
Re: Multithreading a large file split to multiple files
by cavac (Parson) on May 16, 2018 at 12:43 UTC

    While not an answer to multithreading, you might look into some other performance improvements as well. I don't know the format of your source file, but there might be some things to look for:

    • Do you need to split/,/ all the way or do you only need the first few columns? Limiting the splitting might speed things up.
    • Are some of the things you check for by regex actually on a fixed position in the string? substr() might be faster.
    • Are you really limited by the CPU? You might be slowed down by IO. Putting source and destination on different drives might help.
    • The statement print EQFILE ",,eq_".$arrn1.",1,\n" runs a useless string concatenation. Use commas instead of dots between string parts. Also, the first fixed string can use single quotes, this removes the need to run the "check if string contains variables that we need to replace" part of string generation.
    "For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."
      One more possible optimization: when you want to check for the presence of constant strings (e.g. Index, USDSWAP) in the input, you can use if (index($_, 'USDSWAP') != -1) { ... } instead of a regex.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1214504]
Approved by beech
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-03-29 05:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found