taj_ritesh has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks. I have script to process a customized pattern generation & find number of occurrence a selected pattern..

I am trying this on a file with 15000 page file. It is getting an issue that it is running for a very long time to not getting complete.

Can you please check the my code & help me to fix this, so it can run quickly & get complete

#!/usr/bin/perl ############################################### sub get_first_elements_of_string { my @a = (split (' ' ,"$_[0]")); return $a[0]; } ############################################### sub array_search { my ($elem, @arr) = @_; my $flag = -2; $mn = 0; foreach $n (@arr) { if ($n eq $elem) { $flag = $mn; last; } $mn++; } return $flag; } ############################################### sub get_data_path_report { my @spep = (m/ (\S+).*Endpoint: (\S+).*/msg); my $s_p = $spep[0]; my $e_p = $spep[1]; my @data_path = (m/ .*?Endpoint: .*?$s_p(.*?)$e_p.*?data arrival t +ime/msg); my $data_path_length = @data_path; my @A = (); if($data_path_length == 1) { print "\npath:\t $s_p -----to------$e_p-----"; print "size of data_path: $data_path_length\n"; foreach $ele (@data_path) { print "ele------- $ele\n"; my @data_path_elements = (split ('\n',$ele)); my $l = @data_path_elements; print "----- Size of data path elements : $l\n"; shift(@data_path_elements); foreach $x (@data_path_elements) { #@arr = (split (' ',$x)); my $c = &get_first_elements_of_string($x); print "split--ele: $x\n"; print "------\t$c\n"; push (@A,$c); } } print "DATA-PATH-Elements: @A"; #shift(@A); print "#### $#A\n"; return @A; } } ############################################### sub get_patterns { my $sp = $_[0]; my $ep = $_[0]; foreach my $k (2 .. $#_) { if($_[$k] == -2) { next; } if($_[$k] == $ep +1) { $ep = $_[$k]; } else { $sp = $_[$k]; $ep = $_[$k]; } } } ############################################### #open(fh, "timing_report_1.txt"); open(fh, "tim_icc_dec12b"); $/ = "Startpoint:"; my @result = (); while (<fh>) { my @a = (); @a = &get_data_path_report ($_); my %seen = (); push (@result, @a); @result = grep { !$seen{$_}++ } @result; } shift(@result); my $U_L = @result; print "\n\nUNIQUE CELLS: @result ===== $U_L\n"; ############################################### my $k =0; foreach $h (@result) { print "====== $k -----> $h\n"; $k++; } close(fh); ############################################### my $u_l = @result; my $i = 0; my $max_score = -1; my @score_board_matrix = (); while($i < $u_l) { my $score_board_column = ""; my $j = 0; while($j<$u_l) { $score = 0; open(fh, "tim_icc_dec12b"); $/ = "Startpoint:"; while(<fh>) { my @d = &get_data_path_report ($_); if($#d >=1) { my $ss = &array_search("$result[$i]", @d); my $ee = &array_search("$result[$j]", @d); print " !!!! ### $ss ----- $result[$i] <----> $ee ---- +- $result[$j] #### !!!!!\n"; if($ee == $ss+1) { $score++; } } } close(fh); $score_board_column = $score_board_column." ".$score; if($score > $max_score) { $max_score = $score; } print "\n---------------------------$result[$i] $result[$j]--- +-\t$score------> $score_board_column\n"; $j++; } $i++; push (@score_board_matrix, $score_board_column); print "\n"; } ############################################### print "######################\n@score_board_matrix\nMAX_SCORE: $max_sc +ore\n"; my $row = 0; my @array_indexes = (); foreach $column (@score_board_matrix) { my @re = split(" ",$column); my $index = &array_search($max_score,@re); print "\n----- $column------> $#re ------->$row,$index----> $result[$r +ow]<------>$result[$index]\n"; push (@array_indexes, "$index"); $row++; } print "%%%%%%%%%%%%%%%%%%\n"; print "@array_indexes"; my $sp = $array_indexes[0]; my $ep = $array_indexes[0]; print "## $sp ---- $ep\n"; shift(@array_indexes); print @array_indexes; foreach $k (@array_indexes) { if($k == -2) { next; } if($k == $ep +1) { $ep = $k; next; } else { print "PATTERN: $sp ---- $ep"; $e = $sp; while ($e <= $ep) { print "$result[$e]-->"; $e++; } $sp = $k; $ep = $k; } }

Let me add some detail to make this issue more clear.. The code is getting stuck at following section

my $u_l = @result; my $i = 0; my $max_score = -1; my @score_board_matrix = (); while($i < $u_l) { my $score_board_column = ""; my $j = 0; while($j<$u_l) { $score = 0; open(fh, "tim_icc_dec12b"); $/ = "Startpoint:"; while(<fh>) { my @d = &get_data_path_report ($_); if($#d >=1) { my $ss = &array_search("$result[$i]", @d); my $ee = &array_search("$result[$j]", @d); print " !!!! ### $ss ----- $result[$i] <----> $ee ---- +- $result[$j] #### !!!!!\n"; if($ee == $ss+1) { $score++; } } } close(fh); $score_board_column = $score_board_column." ".$score; if($score > $max_score) { $max_score = $score; } print "\n---------------------------$result[$i] $result[$j]--- +-\t$score------> $score_board_column\n"; $j++; } $i++; push (@score_board_matrix, $score_board_column); print "\n"; }

One sample of the target data file is as following :-

Startpoint: sdram_clk (clock source 'SDRAM_CLK') Endpoint: sd_DQ_out[6] (output port clocked by SD_DDR_CLK) Path Group: COMBO Path Type: max Point Fanout Cap Tra +ns Incr Path -------------------------------------------------------------------- +-------------------------- clock SDRAM_CLK (fall edge) + 3.750000 3.750000 sdram_clk (in) 0.1849 +22 0.065438 & 3.815438 f sdram_clk (net) 17 0.124019 + 0.000000 3.815438 f I_SDRAM_TOP/sdram_clk (SDRAM_TOP) + 0.000000 3.815438 f I_SDRAM_TOP/sdram_clk (net) 0.124019 + 0.000000 3.815438 f I_SDRAM_TOP/I_SDRAM_IF/sdram_clk (SDRAM_IF) + 0.000000 3.815438 f I_SDRAM_TOP/I_SDRAM_IF/sdram_clk (net) 0.124019 + 0.000000 3.815438 f I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/I (bufbd7) 0.1878 +10 0.013919 & 3.829357 f I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/Z (bufbd7) 0.2331 +13 0.210904 & 4.040261 f I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 (net) 45 0.175550 + 0.000000 4.040261 f I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/S (mx02d4) 0.2340 +98 0.003310 & 4.043571 f I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/Z (mx02d4) 0.9991 +21 0.776377 4.819948 f I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] (net) 1 0.475020 + 0.000000 4.819948 f I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] (SDRAM_IF) + 0.000000 4.819948 f I_SDRAM_TOP/sd_DQ_out[6] (net) 0.475020 + 0.000000 4.819948 f I_SDRAM_TOP/sd_DQ_out[6] (SDRAM_TOP) + 0.000000 4.819948 f sd_DQ_out[6] (net) 0.475020 + 0.000000 4.819948 f sd_DQ_out[6] (out) 0.9991 +21 0.010237 & 4.830185 f data arrival time + 4.830185 clock SD_DDR_CLK (rise edge) + 7.500000 7.500000 clock network delay (ideal) + 1.598546 9.098545 clock uncertainty + -0.100000 8.998545 output external delay + -2.000000 6.998545 data required time + 6.998545 -------------------------------------------------------------------- +-------------------------- data required time + 6.998545 data arrival time + -4.830185 -------------------------------------------------------------------- +-------------------------- slack (MET) + 2.168359

This work I am trying to scan path from start from "Endpoint" to "data arrival time" & then break first column for each line in this section in one pattern, expand this pattern in +/- 2 range the scan each pattern in whole file which is having 1000 such path as shown in example to find out number of times each pattern is getting repeated.

I have tried the suggested the work around, but it is not working, can you please help me with correct set of code to resolve this issue..

Replies are listed 'Best First'.
Re: How to manage a big file
by Athanasius (Archbishop) on Apr 23, 2014 at 07:56 UTC

    Hello taj_ritesh,

    As Anonymous Monk has pointed out, your get_patterns sub operates on local variables only, and therefore accomplishes nothing.

    Here is another problem: You declare a variable as my @resilt = ();, but then attempt to access it as @result. If you had included:

    use strict;

    at the head of your script, Perl would have flagged this mistake for you (along with at least six other cases in which a variable is used without first being declared).

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: How to manage a big file
by AnomalousMonk (Archbishop) on Apr 23, 2014 at 08:24 UTC

    Some very discursive comments after a very brief inspection...

    my @resilt = (); while (<fh>) { my @a = (); @a = &get_data_path_report ($_); my %seen = (); push (@result, @a); @result = grep { !$seen{$_}++ } @result; } shift(@result);

    You seem to have used lexicals pretty consistently, but then undermined their use by not enabling strictures — and warnings for good measure. See warnings and strict. Add these two lines at the very start of your program
        use warnings;
        use strict;
    and then fix all the errors and warnings.

    push (@result, @a); @result = grep { !$seen{$_}++ } @result;

    The push statement is redundant. The statement
        @result = grep { !$seen{$_}++ } @a;
    would have the same effect, and a further simplification would be to use List::MoreUtils::uniq as in
        @result = uniq get_data_path_report();
    alone, all other statements in the while-loop being needless.The statement
        use List::MoreUtils qw(uniq);
    must be added at the start of the script to import uniq. (The function  get_data_path_report() does not need to have  $_ passed to it because the function takes no arguments — as far as I can see by quick inspection.)

    The function

    sub get_first_elements_of_string { my @a = (split (' ' ,"$_[0]")); return $a[0]; }
    could be re-written (untested)
    sub get_first_elements_of_string { my ($first) = split ' ', $_[0], 2; return $first; }
    (see split for the LIMIT parameter) which will not change its effect, but may improve its performance.

    Updates:

    1. Also WRT the  while (<fh>) { ... } loop: As it stands in the OP, this loop reads and processes the entire file, but then only keeps the final record read for further processing. Is this what you intended? Did you perhaps intend something like
          push @result, uniq get_data_path_report();
      instead?

Re: How to manage a big file
by Lennotoecom (Pilgrim) on Apr 23, 2014 at 07:47 UTC
    please add this to your question:
    1. A few lines of an orignial file with changed essential data. So we won't steal anything.
    2. What exactly do you want to extract out of it.
    3. How exactly do you want to store the results.
    Thank you.
    UPDATE
    So you start with file containing this blocks of code
    Startpoint: sdram_clk (clock source 'SDRAM_CLK') Endpoint: sd_DQ_out[6] (output port clocked by SD_DDR_CLK) Path Group: COMBO Path Type: max Point Fanout Cap Tra +ns Incr Path -------------------------------------------------------------------- +-------------------------- clock SDRAM_CLK (fall edge) + 3.750000 3.750000 sdram_clk (in) 0.1849 +22 0.065438 & 3.815438 f sdram_clk (net) 17 0.124019 + 0.000000 3.815438 f I_SDRAM_TOP/sdram_clk (SDRAM_TOP) + 0.000000 3.815438 f I_SDRAM_TOP/sdram_clk (net) 0.124019 + 0.000000 3.815438 f I_SDRAM_TOP/I_SDRAM_IF/sdram_clk (SDRAM_IF) + 0.000000 3.815438 f I_SDRAM_TOP/I_SDRAM_IF/sdram_clk (net) 0.124019 + 0.000000 3.815438 f I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/I (bufbd7) 0.1878 +10 0.013919 & 3.829357 f I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/Z (bufbd7) 0.2331 +13 0.210904 & 4.040261 f I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 (net) 45 0.175550 + 0.000000 4.040261 f I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/S (mx02d4) 0.2340 +98 0.003310 & 4.043571 f I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/Z (mx02d4) 0.9991 +21 0.776377 4.819948 f I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] (net) 1 0.475020 + 0.000000 4.819948 f I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] (SDRAM_IF) + 0.000000 4.819948 f I_SDRAM_TOP/sd_DQ_out[6] (net) 0.475020 + 0.000000 4.819948 f I_SDRAM_TOP/sd_DQ_out[6] (SDRAM_TOP) + 0.000000 4.819948 f sd_DQ_out[6] (net) 0.475020 + 0.000000 4.819948 f sd_DQ_out[6] (out) 0.9991 +21 0.010237 & 4.830185 f data arrival time + 4.830185 clock SD_DDR_CLK (rise edge) + 7.500000 7.500000 clock network delay (ideal) + 1.598546 9.098545 clock uncertainty + -0.100000 8.998545 output external delay + -2.000000 6.998545 data required time + 6.998545 -------------------------------------------------------------------- +-------------------------- data required time + 6.998545 data arrival time + -4.830185 -------------------------------------------------------------------- +-------------------------- slack (MET) + 2.168359
    from which you want to extract this
    clock SDRAM_CLK sdram_clk sdram_clk I_SDRAM_TOP/sdram_clk I_SDRAM_TOP/sdram_clk I_SDRAM_TOP/I_SDRAM_IF/sdram_clk I_SDRAM_TOP/I_SDRAM_IF/sdram_clk I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/I I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/Z I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/S I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/Z I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] I_SDRAM_TOP/sd_DQ_out[6] I_SDRAM_TOP/sd_DQ_out[6] sd_DQ_out[6] sd_DQ_out[6]
    which is a sequence, which you want find all of in your entire file and count them? Did I get it correctly?

      Now each line is a potential pattern & we need to make four more pattern by adding next two & previous two lines, for example :-

      Pattern (1):- I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 ( COnsider this as a seed pattern)

      Pattern (2):- I_SDRAM_TOP/I_SDRAM_IF/sdram_clk I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16 I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 ( Seed +2)

      Pattern (3):- I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16 I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 (seed +1)

      Pattern (4):- I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6 (seed -1)

      Pattern (5):- I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6 I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out6 (seed -2)

      Now these five patterns needs to scan in the master file & print each pattern wise occurrence count.

        while thinking about your last addition,
        consider this
        open IN,'<1'; while (<IN>) { $a = 1 if /Endpoint/; $b = 1 if /-/ and $a; if($a and $b){ if(/^\s+(.+)\s\(/){ $h{$1}{'x'} = ++$x if ! exists $h{$1}; $r .= $h{$1}{'x'}."."; } } $a = 0, $b = 0, ++$r{$r}, $r = '' if /data arrival time/ and $ +b; } close IN; foreach (sort keys %r) { print "$_: $r{$_}\n"; }
        algorithm:
        find /Endpoint/
        find /-/
        assign every line pattern a unique code from 1 towards infinity
        form the unique string number, count it as one occurence
        find /data arrival time/
        which will result in something like this:
        1.2.2.3.3.4.4.5.6.7.8.9.10.10.11.11.12.12.: 8
        which means I copied your example file eight times in my '1' file
        update
        regarding your patterns:
        they are quite confusing
        why did you pick up the middle line to form your pattern?
        Please consider this snippet above, maybe it will suit your needs as well?
        update 2
        have you tried this soultion? has it worked? please tell.
Re: How to manage a big file
by Anonymous Monk on Apr 23, 2014 at 06:48 UTC

    Can you please check the my code & help me to fix this, so it can run quickly

    Not really , the purpose of the program is unclear .... you have included sub get_patterns but it does nothing

    It would help if you could explain what problem the program is supposed to solve, and using what steps ... maybe provide short representative sample input ... things like that

Re: How to manage a pattern matching & counting with big data file
by Anonymous Monk on Apr 23, 2014 at 14:05 UTC
    I believe also that you are passing an entire array around to array_search() instead of passing a reference to it . . .