sainky has asked for the wisdom of the Perl Monks concerning the following question:

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question: Hello All, I have two excel files containing 58000 and 30000 entries respectively. I want to find out how many entries of file_1 are present in file_2. I have written below program but it is taking days to execute, Can anyone have better solution?

#Modules to use use Cwd 'abs_path'; use Win32::OLE; use Win32::OLE qw(in with); use Win32::OLE::Const "Microsoft Excel"; $Win32::OLE::Warn = 3; my $start_run = time(); print"[USAGE]: <demoscript.pl> <.map file path> <Ouput excel path>\n"; print"[INFO]: Executing demoscript.pl V1.0.1.\n"; #Global Variables my @longNames; my $SystemPortTracingReport = $ARGV[0]; my $CLEA_Family = $ARGV[1]; my $GlobalB_HighFeature = $ARGV[2]; my $Excel =Win32::OLE->new('Excel.Application'); # Open Excel file my $Book = $Excel->Workbooks->Open($SystemPortTracingReport); my $Book2 = $Excel->Workbooks->Open($CLEA_Family); #Make Excel visible $Excel->{Visible} = 1; my $Sheet = $Book->Worksheets("GlobalB_HighFeature"); my $Sheet2 = $Book2->Worksheets("GlobalB_HighFeature_SerialDataR"); my $longNamecol; my $Signalcolumn; for($i=1; $i<30;$i++) { if($Sheet->Cells($startingrow, $i)->{'Value'} eq "Long Name") { $longNamecol = $i; } } for($i=1; $i<30;$i++) { if($Sheet2->Cells($startingrow, $i)->{'Value'} eq "Long Name") { $Signalcolumn = $i; } } print"\n$longNamecol\n"; print"\n$Signalcolumn\n"; my $emptyline=0; for(my $k=2; $k<521754; $k++) { if($Sheet->Cells($k,$longNamecol)->{'Value'} eq "") { $emptyline++; } else { push(@longNames, $Sheet->Cells($k,$longNamecol)->{'Value'} +); # my $temp = $Sheet->Cells($k,$longNamecol)->{'Value'}; # print"\n $temp"; $count++; } } my $data; my $length; $length = scalar(@longNames); print"Length $length"; print"Searching...........\n"; $Book->Save; #Or $Book->SaveAs("C:\\file_name.xls"); $Book->Close; #or $Excel->Quit; for(my $m=0; $m<$length; $m++) { for(my $l=2; $l<2990; $l++) { if($Sheet2->Cells($l,$Signalcolumn)->{'Value'} eq "") { next; } else { if($Sheet2->Cells($l,$Signalcolumn)->{'Value'} eq $longNam +es[$m]) { $Sheet2->Cells($l,100)->{'Value'} = "matched"; print"\n***********Matched*************\n"; } } } } print"\n**********finished!**************\n"; #Save Excel file $Book2->Save; #Or $Book->SaveAs("C:\\file_name.xls"); $Book2->Close; #or $Excel->Quit; my $end_run = time(); my $run_time = $end_run - $start_run; print"\n[INFO]Timetaken: $run_time [Secs]";

Replies are listed 'Best First'.
Re: Xls reader (updated)
by haukex (Archbishop) on Jan 02, 2019 at 09:31 UTC

    Two things I see right off the bat: First, you've got nested loops (for(my $m=0; $m<$length; $m++) { for(my $l=2; $l<2990; $l++) { ...) for a simple string equality search, where the outer loop could be removed and replaced by a much faster hash lookup. Second, inside the loop, you're making a ton of OLE calls with $Sheet2->Cells($l,$Signalcolumn)->{'Value'} - it'd be better if you cache this in a local variable. Something like this (untested):

    my %longNames = map { ($_=>1) } @longNames; for my $l (2..2990) { my $v = $Sheet2->Cells($l, $Signalcolumn)->{'Value'}; next unless length $v; if ( $longNames{$v} ) { $Sheet2->Cells($l, 100)->{'Value'} = "matched"; print"\n***********Matched*************\n"; } }

    Update: Now tested, and I can confirm it runs much faster. Note that you really should Use strict and warnings, you've got a couple of undeclared variables like $startingrow.

Re: Xls reader
by Corion (Patriarch) on Jan 02, 2019 at 09:36 UTC

    In addition to haukex good comment, look at this line:

    for(my $k=2; $k<521754; $k++)

    In the case that your Excel sheet does not have 521754 rows, consider using $Sheet2->UsedRange->Rows instead:

    Update: See the corrections by haukex below!

    my $used_rows = $Sheet2->UsedRange->Rows(); for my $k (2..$used_rows) { }
      $Sheet2->UsedRange->Rows

      Unfortunately I don't think that works, since Rows still returns an object, and in numeric context returns the memory address. Also, AFAIK, if the sheet has empty rows and columns at the beginning, the UsedRange can begin somewhere other than A1 - what I've been using is something like this:

      my $UsedRows = 0+$Sheet->UsedRange->{Rows}->{Count}; my $FirstRow = 0+$Sheet->UsedRange->Rows(1)->{Row}; my $LastRow = $FirstRow + $UsedRows - 1; my $UsedCols = 0+$Sheet->UsedRange->{Columns}->{Count}; my $FirstCol = 0+$Sheet->UsedRange->Columns(1)->{Column}; my $LastCol = $FirstCol + $UsedCols - 1;
Re: Xls reader
by sainky (Initiate) on Jan 02, 2019 at 12:40 UTC
    @haukex,@Corion: Thank you so much :) It worked and it is much faster now, I am using strict and warnings now :)