ravi45722 has asked for the wisdom of the Perl Monks concerning the following question:

I write a code normally and its taking 186 wall clocks to read the total files. To reduce the time I created process and split my load for two process. After creating process its taking 261 wall clock seconds. What's the mistake I am doing?? I think by creating process and running it parallel may reduce the execution time. But its increased. How???

sub SMSBcastCDR { #doing operation on files } sub SMSCDR { #doing operation on files } LINKS: foreach my $linkarray (1 .. 2) { $pm->start and next LINKS; # do the fork if ($first == 1) { my @cdr_list1 = `ls $cdr_directory/SMSBcastCDR_*_$bcat_cdrdate +\_*.log`; print "cdrs_file1 = @cdr_list1\n"; SMSBcastCDR(@cdr_list1); $first++; } if ($first == 2) { my @smsc_cdr_list=`ls $smscdr_directory/SMSCDR_P*_$cdr +date*.log`; SMSCDR(@smsc_cdr_list); } $pm->finish; # do the exit in the child process } $pm->wait_all_children;

Replies are listed 'Best First'.
Re: Problem in creating process
by BrowserUk (Patriarch) on Nov 23, 2015 at 09:42 UTC
    I think by creating process and running it parallel may reduce the execution time. But its increased. How???

    Take out a paper telephone directory -- if you still have one, otherwise a dictionary will do -- then get a friend to help you; and both of you to try and lookup a different number (or word) at the same time!

    That's what happens when two processes both try to use a single disk drive at the same time.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Good example. I understand now. There is there any other way to make it a little bit faster???

Re: Problem in creating process
by hippo (Archbishop) on Nov 23, 2015 at 09:59 UTC
    I think by creating process and running it parallel may reduce the execution time. But its increased. How???

    In addition to BrowserUK's excellent analogy you might consider that parallelisation has overheads. You might also overextend the RAM into swapping when running in parallel which might not happen with sequential processing.

    In any case, did you not profile the code to determine where the bottlenecks lie before trying to optimise? Without analysis, remedial action is pure guesswork.

      did you not profile the code to determine where the bottlenecks lie before trying to optimise?

      Given the code the OP has supplied, there is little extra information that might be gleaned from profiling.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.

        It could be that the machine has only one CPU available, or that the whole process is IO bound or that the program(s) are starved for RAM. "Profiling" at this high level of looking at taskmon to find out what is maxed out could at least help to optimize, or in the IO case, to stop optimizing and buy other hardware.

        Given the code the OP has supplied

        That's the pertinent point. The OP has not supplied the full code so we can only guess as to what the rest of it is doing. Further, the fact that the durations reported are wallclock times means that any other process could also be interfering with the resultant durations reported.

        That's not to say that your suggestion that the processes are IO-bound is wrong or even unlikely and I too suspect that this will turn out to be the limiting factor in the OP's case. But it is just an guess at this point even if an educated one.

        I believe that hippo is suggesting that he should have profiled the original. The parallel approach appears to be a 'guess'.
        Bill
Re: Problem in creating process
by BrowserUk (Patriarch) on Nov 23, 2015 at 10:21 UTC

    The trick to optimising IO-bound processes is to: a) serialise your disk usage; and b) overlap processing of one file with the reading of the next.

    To devise an optimal strategy you need to know some basic information:

    1. How many files of each type? (Average numbers per run is fine.)
    2. How big are those files? (Again average sizes in bytes and lines would be most useful.)

    With that information and a suitable strategy, it may be possible to reduce the overall processing time significantly.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Here is the data of one file. Always SMSBcastCDR is less. But we cant say how many files for each type. It's totally depend upon the traffic. But for yesterday It 92 files for SMSBcastCDR & 46 files for SMSCDR

      "SMSCDR_POSTPAID_151106160000_10.xx.x.xx_RS10.log" [dos] 1182594L, 49 +7460662C "SMSBcastCDR_6020151056461598_2015110615_10.xx.x.xx_Submit_uni_bihar.l +og" [dos] 443437L, 72184936C

        How big (in records) are these files?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Problem in creating process
by Laurent_R (Canon) on Nov 23, 2015 at 08:34 UTC
    For just two independent processes, I would probably use the shell to launch the two processes in the background.

    As for your question, I do not see any reason why it should be longer with parallel processes, but we do not know what SMSCDR and SMScastCDR are doing;

    Update:: spell correction: s/to independent/two independent/

      Here is the code for that function. These functions open the file and build some hashes to upload into DB. Always SMSBcastCDR functions completes it work first(Due to less data). To upload the values into DB i am waiting for both the process complete their work.

      sub SMSBcastCDR { my @array = @_; foreach my $cdr_file (@array) { chomp $cdr_file; print "file name=$cdr_file\n"; open (FPO,"$cdr_file") or die "Could not open $cdr_file\n"; while (my $line = <FPO>) { #Tue Jun 09 2009 23:00:07.225|user2|KAuser2|2222|919846231 +229|1520095992858641|TextSMS|2009-Jun-09 23:00:07.225||150|Submitted| +|231229|0|Success| chomp ($line); my ($cdr_gen_time,$system_id,$msg_submission_time,$msg_del +ivery_time,$record_type,$last_failure_reason) = (split /\|/,$line)[0, +1,7,8,10,14]; #2009-Jun-09 23:00:07.225 my ($year,$mon_txt,$day,$hr) = (split /[-:\s]+/, $msg_subm +ission_time)[0,1,2,3]; my $statsdate = "$hr:$year-$month_name{$mon_txt}-$day"; if ($record_type eq "Flood_Control_Check") { #Mon Jul 19 2010 15:36:15.682 $msg_submission_time=$cdr_gen_time; my ($mon_txt, $day, $year, $hr) = (split /[-:\s]+/,$c +dr_gen_time)[1,2,3,4]; $statsdate = "$hr:$year-$month_name{$mon_txt}-$day"; } #print "$record_type:$statsdate\n"; if ($msg_submission_time ne "") { push (@date_list,$statsdate) unless $seen{$statsdate}+ ++; } else { my ($txtdate, $time) = split(/\ /,$msg_delivery_time); my ($year,$mon_txt,$day) = split(/\-/,$txtdate); my $mon = $month_name{$mon_txt}; my ($hr,$min,$sec) = split(/\:/,$time); $statsdate = "$hr:$year-$mon-$day"; if ($msg_delivery_time ne "") { push (@date_list,$statsdate) unless $seen{$statsda +te}++; } } $hour_info{$statsdate}{$hr}=0; $system_id_info{$statsdate}{$system_id} = 0; if ($record_type eq "Submitted") { $sub_data{$statsdate}{$system_id} ++; $sub_data{$statsdate}{'all'} ++; } elsif($record_type eq "DND_Check" || $record_type eq "Filt +er_Check" || $record_type eq "Special_Character_Check" || $record_typ +e eq "Message_Length_Check" || $record_type eq "Invalid_Message" || $ +record_type eq "NID_Check" || $record_type eq "Mobile_No_Check" || $r +ecord_type eq "Flood_Control_Check") { push (@rej_error_list,$last_failure_reason) unless $re +j_seen{$last_failure_reason}++; $rej_data{$statsdate}{$system_id} ++; $rej_data{$statsdate}{'all'} ++; $bcast_errcnt{$statsdate}{$system_id}{$last_failure_re +ason}++; $bcast_errcnt{$statsdate}{'all'}{$last_failure_reason} +++; } elsif($record_type eq "SubmitResp") { if ($last_failure_reason ne "Success") { push (@rej_error_list,$last_failure_reason) unless + $rej_seen{$last_failure_reason}++; $smsc_rej_data{$statsdate}{$system_id} ++; $smsc_rej_data{$statsdate}{'all'} ++; $bcast_errcnt{$statsdate}{$system_id}{$last_failur +e_reason}++; $bcast_errcnt{$statsdate}{'all'}{$last_failure_rea +son}++; } } } close (FP); `mv $cdr_file $bcastdir`; } }
      sub SMSCDR { my @array = @_; foreach my $cdr_file (@array) { chomp $cdr_file; print "File:::::: $cdr_file",$/; open (FPS,"$cdr_file") or die "Could not open $cdr_file\n"; while (my $line = <FPS>) { chomp ($line); my ( $msg_submission_time,#5 $status,#8 $record_type,#10 $last_failure_reason,#12 $orig_interface,#13 $dcs,#15 $segment_number ,#23 $system_id ,#56 )= (split /\|/,$line) [4,7,9,11,12,14,22,55]; #Wed Jun 10 16:58:58 2009 my ($mon_txt,$day,$hr,$year) = (split /[-:\s]+/,$msg_subm +ission_time)[1,2,3,6]; my $statsdate = "$hr:$year-$month_name{$mon_txt}-$day"; #print "statsdate :$statsdate\n"; push (@date_list,$statsdate) unless $seen{$statsdate}++; $hour_info{$statsdate}{$hr}=0; if (($orig_interface eq "SMPP") && ($record_type eq "Submi +t")) { my ($segment_no,$x) = split (/\//,$segment_number); if ((lc($segment_no) eq "none") || ($dcs eq "8") || ($ +dcs eq "16") || (($dcs eq "ASCII") && ($segment_no eq "1"))) { $system_id_info{$statsdate}{$system_id}=0; if($sms_sub_data{$statsdate}{$system_id} eq "") { $sms_sub_data{$statsdate}{$system_id} = 0; $del_data{$statsdate}{$system_id} = 0; $undel_data{$statsdate}{$system_id} = 0; $exp_data{$statsdate}{$system_id} = 0; } $sms_sub_data{$statsdate}{$system_id} ++; $sms_sub_data{$statsdate}{'all'} ++; if ($status eq "Delivered") { $del_data{$statsdate}{$system_id} ++; $del_data{$statsdate}{'all'} ++; if ($last_failure_reason eq "SMSC_no_error") { $last_failure_reason = "First_Attempt_Deli +vered"; } else { $last_failure_reason = "Delivered_After_Re +try"; } $del_errcnt{$statsdate}{$system_id}{$last_fail +ure_reason} ++; $del_errcnt{$statsdate}{'all'}{$last_failure_r +eason} ++; } elsif ($status eq "Expired") { $exp_data{$statsdate}{$system_id} ++; $exp_data{$statsdate}{'all'} ++; push (@exp_rej_error_list,$last_failure_reason +) unless $exp_seen{$last_failure_reason}++; $exp_errcnt{$statsdate}{$system_id}{$last_fail +ure_reason} ++; $exp_errcnt{$statsdate}{'all'}{$last_failure_r +eason} ++; } else { $undel_data{$statsdate}{$system_id} ++; $undel_data{$statsdate}{'all'} ++; push (@undel_rej_error_list,$last_failure_reas +on) unless $undel_seen{$last_failure_reason}++; $undel_errcnt{$statsdate}{$system_id}{$last_fa +ilure_reason} ++; $undel_errcnt{$statsdate}{'all'}{$last_failure +_reason} ++; } } } } close (FP); `mv $cdr_file $smscdr_archive_directory`; } }

        Where do these hashes used in these subroutines come from? %hour_info, %system_id_info, %sub_data, %rej_data, %bcast_errcnt, %smsc_rej_data, %sms_sub_data, %del_data, %undel_data, %del_errcnt, %xp_data, %exp_errcnt etc.

        I've probably missed a few and there are arrays as well.

        Are they globals? What do you do with the data you've accumulated in them? Is the data accumulated and reset on a per file basis? Or accumulated across all the files of the given type?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Problem in creating process
by BrowserUk (Patriarch) on Nov 23, 2015 at 15:38 UTC

    On the basis of the information provided so far (not withstanding some of it is still missing), the single simplest, cheapest and most effective thing you could do to speed up this processing would be to buy two SSDs and add them to your machine. Arrange for each of the sets of files to be written (or copied) to a different SSD and then run two separate processes, one for each set of files.

    Done correctly, that would reduce your overall elapsed time to the runtime of the largest of the two datasets. Ie. If they were approximately equal, it would roughly half your processing time.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.