JulioRD has asked for the wisdom of the Perl Monks concerning the following question:

Hello and thanks in advance for any help. I am a perl amateur and created a script that needs optimization. The script uses a control file containing the accounts that will be processed. Then will go through different .csv files and if it finds one or many matches it will write it to the corresponding worksheets.

The first couple of files created take 1 - 2 minutes then performance degrades and files 7 and 8 are taking 5 minutes and so on. The sample data is about 40K rows per csv and control file. Can you please let me know what optimization could be done to speed up the process or make it more consistent?

Business need

We're trying to migrate data from legacy core banking application to a Cloud Based one. The cloud solution has an interface that uses excel files as a way to import the data. The excel files need to be a certain size or less. We have tested and about 500 accounts in the excel file is a good size to import into the Cloud solution.

use warnings; use strict; use diagnostics; use Excel::Writer::XLSX; use Text::CSV; # Control File my $AccountControlFile = "bamboo_AccountControlFile.txt"; my @limit_accounts_per_file = 500; my @printoutput = 0; # Loan transaction file #my $LoanTransactionsFileName = "bamboo.loan_transactions.csv"; my $LoanTransactionsFileName = "bamboo.loan_transactions.csv"; my $LoanAccountFileName = "bamboo.loan_accounts.csv"; my $LoanAccountPaymentCalendarFileName = "bamboo.repayment_calendar.cs +v"; # csv class initialization my $TransaccionesPrestamoCSV = Text::CSV->new({ sep_char => ',' }); my $CuentasPrestamoCSV = Text::CSV->new({ sep_char => ',' }); my $CalendarioDePagosCSV = Text::CSV->new({ sep_char => ',' }); if ( $printoutput[0] == 1 ) { print "before file handle \n"; } # Use different files # bamboo control file open (my $bambooControlFileHandle, '<:encoding(UTF-8)', $AccountContro +lFile) or die "Could not open file '$AccountControlFile' $!"; # loan transaction open (my $LoanTransactionFileHandle, '<:encoding(UTF-8)', $LoanTransac +tionsFileName) or die "Could not open file '$LoanTransactionsFileName +' $!"; # loan account open (my $LoanAccountFileHandle, '<:encoding(UTF-8)', $LoanAccountFile +Name) or die "Could not open file '$LoanAccountFileName' $!"; # payment calendar open (my $LoanPaymentCalendarFileHandle, '<:encoding(UTF-8)', $LoanAcc +ountPaymentCalendarFileName) or die "Could not open file '$LoanAccoun +tPaymentCalendarFileName' $!"; if ( $printoutput[0] == 1 ) { print "after file handle \n"; } # counters everywhere my @counter_tran_prestamo = 0; my @counter_cuentas_prestamo = 0; my @counter_calendario_pago = 0; my @prev_account = '0'; my @counter_accounts_per_file = 0; my @counter_files = 0; my @found_calendario_pago = 0; my @found_trans = 0; my @file_seek_tran_prestamo; my @file_seek_cuentas_prestamo; my @file_seek_calendario_pago; my $CuentasPrestamoWS; my $TransaccionesPrestamoWS; my $CalendarioPagosWS; my $workbook; while ( my $AccountInProcess = <$bambooControlFileHandle> ) { #if ( $printoutput[0] == 1 ) { print "Account" . $AccountInProcess; #} # Create another excel file if not writing a new file or the max n +umber of accounts has been reached if ( $prev_account[0] eq "0" or $counter_accounts_per_file[0] >= $ +limit_accounts_per_file[0]) { # Another excel file $counter_files[0] += 1; # new excel file name my @filename = 'ExcelDoc' . $counter_files[0] . '.xlsx' ; $workbook = Excel::Writer::XLSX->new( @filename ); # creating the worksheets $CuentasPrestamoWS = $workbook->add_worksheet( 'Cuentas de P +restamo' ); $TransaccionesPrestamoWS = $workbook->add_worksheet( 'Transac +ciones de Prestamo' ); $CalendarioPagosWS = $workbook->add_worksheet( 'Calendario d +e Pagos' ); $counter_accounts_per_file[0] = 0; # first account was processed $prev_account[0] = "1"; # Reset counters $counter_cuentas_prestamo[0] = 0; $counter_tran_prestamo[0] = 0; $counter_calendario_pago[0] = 0; if ( $printoutput[0] == 1 ) { print "File:" . @filename; print "\n"; } } # clear any white space chomp $AccountInProcess; # go through each loan while ( my $row = $CuentasPrestamoCSV->getline( $LoanAccountFileHa +ndle )) { # Found a match for the loan if ( $row->[0] eq $AccountInProcess ) { # write from from csv file to excel $CuentasPrestamoWS->write_row( $counter_cuentas_prestamo[0 +], 0, $row ); # keep track of line to write in excel $counter_cuentas_prestamo[0] += 1; # Found one account to process $counter_accounts_per_file[0] += 1; last; } } # Go throuh all the payment calendar rows while ( my $row = $CalendarioDePagosCSV->getline( $LoanPaymentCale +ndarFileHandle )) { # found a match with the account in process if ( $row->[0] eq $AccountInProcess ) { # write into the excel shee the row $CalendarioPagosWS->write_row( $counter_calendario_pago[0] +, 0, $row); # keep track of the row in the excel sheet $counter_calendario_pago[0] += 1; if ( $printoutput[0] == 1 ) { @found_calendario_pago = 1; print "did find!!!!!!!!"; } # keep track that we found one line $found_calendario_pago[0] = 1; # keep track of line location. Go back a line if we switch + from found to not found. $file_seek_calendario_pago[0] = tell $LoanPaymentCalendarF +ileHandle; } # did not find an account match with the loan payment calendar +. But were we in a block of "found accounts"? elsif ( $found_calendario_pago[0] == 1){ if ( $printoutput[0] == 1 ) { print @found_calendario_pago . "\n"; print "did not find??? \n"; } # go back to the previous line seek $LoanPaymentCalendarFileHandle, -$file_seek_calendari +o_pago[0], 1; # flag off to calendario de pago found $found_calendario_pago[0] = 0; # get out of loop and process the next one last; } } # Go through all the rows for the loan transaction file while (my $row = $TransaccionesPrestamoCSV->getline( $LoanTransact +ionFileHandle ) ) { # if the row of the csv file is the same as the loan account f +rom the control file if ( $row->[0] eq $AccountInProcess ){ #and $row->[0] ne unde +f and $row->[0] ne "\n") { # write row in worksheet $TransaccionesPrestamoWS->write_row( $counter_tran_prestam +o[0], 0, $row ); # move to next row on excel sheet for tran prestamos $counter_tran_prestamo[0] += 1; # found the transaction, next one not found stop the file +handle on that line $found_trans[0] = 1; # File is ordered by account number. Whole block of loan t +ransactions. # need position of filehandle to go back a line if the nex +t line is a different account. $file_seek_tran_prestamo[0] = tell $LoanTransactionFileHan +dle; } # account not found to go back a line elsif ( $found_trans[0] == 1){ # reset flag to not found for next account in control file $found_trans[0]= 0; # go back a line, leave filehandle in same position seek $LoanTransactionFileHandle, -$file_seek_tran_prestamo +[0], 1; # get out of loop last; } # end while loan payments } # End while for excel files } close $LoanTransactionFileHandle; close $LoanAccountFileHandle; print "final";

Replies are listed 'Best First'.
Re: Optimization of script
by ww (Archbishop) on Aug 18, 2016 at 15:40 UTC

    Without studying your code (220 lines is a lot to ask of freebie help; 20 illustrating the same problem is much preferred), you might want to try profiling your code -- Dev::Deprof, Devel:NYTProf, etc. -- though your problem may be in the access time for data stored (somewhere... where?) with your "legacy...application." Also, you're running a lot of counters, each of which eats up cycles and could, as they and your export files get larger, slow your process.

    And (afterthought!) if your data is in CSV files, why not inport the old data to a DB -- MySQL, SQLite, etc, and let the database engine do the counting and sorting? If necessary (and you may be able to export more-or-less direct to Excel files) the worst (IMO) that does is require you to reexport to CSV's and then export/store those as Excel files.

    Hope this helps; I know it's not a careful review of your code nor an Rx for optomizing it, but as I said above, a post with 200+ lines of code (and no data) is a lot with which to burden the Monks. Please On asking for help & How do I post a question effectively?.


    ++$anecdote ne $data

      Thanks ww. I appreciate taking the time to comment on the post. I do understand is a lot of code, but since I'm starting to learn how to do PERL there's a lot of things I'm not familiar with. Did not want to leave anything out in case there's something obvious I have not done. Apologize for the lengthy post (and now the long response, lol). Guess I was more focused on the solution than comunincating it well on the forum. Will do better next time.

      Thanks for the info about the profiler and other options, this is a one time script for a very specific task. a one and done type of deal. So need to weigh all the options and see how much more effort to dedicate. Thanks again from Dominican Republic. Come visit.

Re: Optimization of script
by dcmertens (Scribe) on Aug 18, 2016 at 16:00 UTC

    Per your request, I am going to focus on how to leverage Perl to make this script run quickly. There are a lot of little stylistic things that would make the code both shorter and more legible. I'm not the style police, though, and won't offer stylistic advice unless you would like to have it.

    Looking at your implementation, your code only makes sense if you are working on accounts that have already been sorted by account number. If that's not the case, then I can't see how this would work at all. Furthermore, the control file would contain a subset of accounts to transfer (otherwise you wouldn't be looping through rows in LoanAccountFileHandle) and each account is associated with only one loan, or the control file has multiple lines repeating the same account number, one for each loan. This seems like a brittle approach, but who cares, it's a one-time transition script, and presumably you have a robust method for creating the control file.

    With those assumptions in place, it looks to me like your algorithm should scale roughly linearly with the number of lines processed. Why would it slow down? The only thing I can come up with is that you're hitting memory issues on your machine. This could arise because either the Excel writer or the CSV reader are leaking memory (not that they're intrinsically leaky, it's just my guess). When your script is running, what does the RAM consumption look like?

      Just thinking aloud here: couldn't you split your master control file into a number of smaller control files, and then run this script on them separately? That way you get what you want without having to figure out what's really going wrong. :-)
Re: Optimization of script
by Cristoforo (Curate) on Aug 18, 2016 at 20:48 UTC
    Like ww said, the post is pretty long to try to analyze and I can't comment on the correctness of the wooksheets you're creating.

    I'm assuming the accounts in the control file are in the same order in the csv files and that the records matching the 'AccountInProcess' in the csv files are one sfter the other (if there are more than 1 record in the csv file matching the 'AccountInProcess').

    I think the the seek statements are wrong and they may be the reason for the slow down. They are seeking backwards more than 1 line.

    seek $LoanPaymentCalendarFileHandle, -$file_seek_calendario_pago[0], 1 +;

    I think you need

    seek $LoanPaymentCalendarFileHandle, $file_seek_calendario_pago[0], 0;
    This version of seek will position the file to start reading on the line immediately after the last successful match which is what you want. Your seek moves it way back to near the beginning of the file. So, you are effectively rereading the whole file from the beginning each time.

    There are a few items in your program that could be improved.

    # counters everywhere my @counter_tran_prestamo = 0; my @counter_cuentas_prestamo = 0; my @counter_calendario_pago = 0; my @prev_account = '0'; my @counter_accounts_per_file = 0; my @counter_files = 0; my @found_calendario_pago = 0; my @found_trans = 0; my @file_seek_tran_prestamo; my @file_seek_cuentas_prestamo; my @file_seek_calendario_pago;
    These variables could all be declared as scalars instead of declared as arrays. You don't use them as arrays in your program anyway.

    So you could say, for instance, $found_calendario_pago = 0; instead of $found_calendario_pago[0] = 0;. (The same for all the other counters and flags.)

    Declared like this, instead:

    my $counter_tran_prestamo = 0; my $counter_cuentas_prestamo = 0; my $counter_calendario_pago = 0; my $prev_account = '0'; my $counter_accounts_per_file = 0; my $counter_files = 0; my $found_calendario_pago = 0; my $found_trans = 0; my $file_seek_tran_prestamo; my $file_seek_cuentas_prestamo; my $file_seek_calendario_pago;
Re: Optimization of script
by GotToBTru (Prior) on Aug 18, 2016 at 21:37 UTC

    First of all, I think ww is exactly right in this post with the suggestion to use a database. With proper indexing matching rows between tables will run very quickly indeed.

    I get the idea that the approach is to try to move through all 3 data files, keeping careful track of your progress so that you don't lose your place. You might find it useful to write a subroutine that handles the searching through each file, keeping a pointer to the last place something was found.

    The following is crude, and incomplete, but I hope gives some idea of what I'm suggesting. The hash %output uses the input file names as a key, and contains a reference to an array which will hold the lines that need to be written out to the corresponding output file. The hash %filepointers is used to keep track of the last spot in the input files where an account was found.

    use strict; use warnings; open my $controlFileHandle,'<','control.csv'; my ($file1, $file2, $file3) = qw/file1.csv file2.csv file3.csv/; my (%output, %filepointers,$ofh,$infile,$line,$account,$accountline ); while ($accountline = <$controlFileHandle>) { $account = (split /,/,$accountline,2)[0]; lookForAccountInFile($account,$file1); lookForAccountInFile($account,$file2); lookForAccountInFile($account,$file3); } foreach $infile (keys %output) { open $ofh,'>',"new_$infile"; foreach $line (@{$output{$infile}}) { print $ofh $line } close $ofh; } sub lookForAccountInFile { my ($account,$file) = @_; open my $ifh,'<',$file; if (defined $filepointers{$file}) { seek $ifh, $filepointers{$file}, 0 } my $found = 0; while (my $line = <$ifh>) { last if ($line eq "\n"); my $la = (split /,/,$line,2)[0]; last if ($la > $account); if ($la == $account) { push @{$output{$file}},$line; $found = 1; } $filepointers{$file} = tell($ifh); last if ($found); } }

    Updated to compile and run. I will show my test files below. To answer specifically how this approach helps with optimizing, 1) using a subroutine means you need only fine-tune the code once,and then gain the benefit as many times as you use it; 2) this reduces considerably the number of arrays and other variables. I did not use the CSV module since it appeared you only were using the very first column in each record. If you will need to explore the records more in depth, you will absolutely want to incorporate that module. It handles all sorts of special cases that would trip up this approach. One of your requirements that I did not implement was the account limit per file. But if I did everything, you'd have no opportunity to learn! ;)

    Data: control.csv: 1,control,record,1 2,control,khrecord,2 5,control,recordi,5 7,control,record,7 file1.csv: 1,file,1,record,1 2,file,1,record,2 3,file,1,record,3 file2.csv: 2,record,1,file,2 4,record,2,file,2 5,record,3,file,2 6,record,4,file,2 7,record,5,file,2 file3.csv: 4,file,3,record,4 5,file,3,record,5
    But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

Re: Optimization of script
by Marshall (Canon) on Aug 19, 2016 at 08:49 UTC
    I liked Cristoforo's comments++ about the seek statements.
    =========== code extract ========== $file_seek_calendario_pago[0] = tell $LoanPaymentCalendarFileHandle; } # did not find an account match with the loan payment calendar. But we +re we in a block of "found accounts"? elsif ( $found_calendario_pago[0] == 1){ if ( $printoutput[0] == 1 ) { print @found_calendario_pago . "\n"; print "did not find??? \n"; } # go back to the previous line seek $LoanPaymentCalendarFileHandle, -$file_seek_calendario_pago[0 +], 1; ========= code extract ========= $file_seek_tran_prestamo[0] = tell $LoanTransactionFileHandle; } # account not found to go back a line elsif ( $found_trans[0] == 1){ # reset flag to not found for next account in control file $found_trans[0]= 0; # go back a line, leave filehandle in same position seek $LoanTransactionFileHandle, -$file_seek_tran_prestamo[0], + 1; # get out of loop last;
    First, "tell" operates on bytes, not characters. These files are opened in UTF8 mode.
    seek $LoanTransactionFileHandle, -$file_seek_tran_prestamo[0], 1;
    This statement says to move backwards in the file from the current position, the number of bytes into the file from when this "tell" measurement was taken. I suspect that means to the beginning of the file or near to it!

    File operations like "seek" can be "expensive" in execution time" if buffers are flushed and an actual disk operation has to be performed - seeking this far backwards will definitely "do that". Then a whole bunch of I/O is required to get back to "where you where".

    I would certainly consider keeping the 2 files that you use "seek" upon in memory if possible. If they are just a few hundred MB that is certainly feasible on a 32 bit machine. Of course you can build your own cache so that the seek isn't necessary, but that takes more coding that perhaps won't be used again.

    On the other hand, if you have something that you are sure produces the right result and it can complete over the weekend. Then maybe what you have is "good enough" for this one time job? Consider throwing hardware at it with a 64 bit machine and a lot of memory. Just a thought.

    Update: this is odd: elsif ( $found_trans[0] == 1), perhaps !=0 is right? There is no way to "seek back one line of text". You can only seek back some number of bytes, which in the case of UTF8 probably doesn't mean even number of characters. If this is an array of lines in memory, then of course this is much easier, move back an index number or perhaps unshift the line so that it can be "shifted back out again" in certain while constructs.