How to manage a pattern matching & counting with big data file

taj_ritesh has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks. I have script to process a customized pattern generation & find number of occurrence a selected pattern..

I am trying this on a file with 15000 page file. It is getting an issue that it is running for a very long time to not getting complete.

Can you please check the my code & help me to fix this, so it can run quickly & get complete

#!/usr/bin/perl
###############################################
sub get_first_elements_of_string
{
    my @a = (split (' ' ,"$_[0]"));
    return $a[0];
}
###############################################
sub array_search {
    my ($elem, @arr) = @_;
    my $flag = -2;
    $mn = 0;
        foreach  $n (@arr)
    {
        if ($n eq $elem)
        {
                    $flag = $mn;
        last;
                }
        $mn++;
         }
    return $flag;
 }
###############################################

sub get_data_path_report
{
    my @spep = (m/ (\S+).*Endpoint: (\S+).*/msg);
    my $s_p = $spep[0];
    my $e_p = $spep[1];
    
    my @data_path = (m/ .*?Endpoint: .*?$s_p(.*?)$e_p.*?data arrival t
+ime/msg);
    my $data_path_length = @data_path;

    my @A = ();
    if($data_path_length == 1)
    {
        print "\npath:\t $s_p -----to------$e_p-----";
        print "size of data_path: $data_path_length\n";
        foreach $ele (@data_path)
        {
            print "ele------- $ele\n";
            my @data_path_elements = (split ('\n',$ele));
            my $l = @data_path_elements;
            print "----- Size of data path elements : $l\n";
            shift(@data_path_elements);
            foreach $x (@data_path_elements)
            {
                #@arr = (split (' ',$x));
                my $c = &get_first_elements_of_string($x);
            print "split--ele: $x\n";
                print "------\t$c\n";
                push (@A,$c);
            }
        }
        print "DATA-PATH-Elements: @A";
    #shift(@A);
    print "#### $#A\n";
    return @A;
    }
}
###############################################
sub get_patterns 
{
    my $sp = $_[0];
    my $ep = $_[0];
    foreach my $k (2 .. $#_)
    {
        if($_[$k] == -2)
        {
            next;
        }
        if($_[$k] == $ep +1)
        {
            $ep  = $_[$k];
        }
        else
        {
            $sp = $_[$k];
            $ep = $_[$k];
        }
    }
    
}
###############################################
#open(fh, "timing_report_1.txt");

open(fh, "tim_icc_dec12b");

$/ = "Startpoint:";
my @result = ();
while (<fh>)
{
    my @a = ();
    @a = &get_data_path_report ($_);
    my %seen = ();
    push (@result, @a);
    @result = grep { !$seen{$_}++ } @result;
    
}
shift(@result);
my $U_L = @result;
print "\n\nUNIQUE CELLS: @result ===== $U_L\n";
###############################################
my $k =0;
foreach $h (@result)
{
    print "====== $k -----> $h\n";
    $k++;

}


close(fh);
###############################################

my $u_l = @result;
my $i = 0;
my $max_score = -1;
my @score_board_matrix = ();
while($i < $u_l)
{
    my $score_board_column = "";
    my $j = 0;
    while($j<$u_l)
    {
        $score = 0;
        open(fh, "tim_icc_dec12b");

        $/ = "Startpoint:";
        while(<fh>)
        {
            my @d = &get_data_path_report ($_);
            if($#d >=1)
            {
                my $ss = &array_search("$result[$i]", @d);
                my $ee = &array_search("$result[$j]", @d);
                print " !!!! ### $ss ----- $result[$i] <----> $ee ----
+- $result[$j] #### !!!!!\n";
                if($ee == $ss+1)
                {
                    $score++;
                }
            }
        }
        close(fh);
        $score_board_column = $score_board_column." ".$score;
        if($score > $max_score)
        {
            $max_score = $score;
        } 
        print "\n---------------------------$result[$i] $result[$j]---
+-\t$score------> $score_board_column\n";
        $j++;
    }
    $i++;
    push (@score_board_matrix, $score_board_column);
    print "\n";
}
###############################################
print "######################\n@score_board_matrix\nMAX_SCORE: $max_sc
+ore\n";
my $row = 0;
my @array_indexes = ();
foreach $column (@score_board_matrix)
{
my @re = split(" ",$column);
my $index = &array_search($max_score,@re);
print "\n----- $column------> $#re ------->$row,$index----> $result[$r
+ow]<------>$result[$index]\n";
push (@array_indexes, "$index");
$row++;
}
print "%%%%%%%%%%%%%%%%%%\n";
print "@array_indexes";
my $sp = $array_indexes[0];
my $ep = $array_indexes[0];
print "## $sp ---- $ep\n";
shift(@array_indexes);
print @array_indexes;
    foreach $k (@array_indexes)
    {
        if($k == -2)
        {
            next;
        }
        if($k == $ep +1)
        {
            $ep  = $k;
            next;
        }
        else
        {
        print "PATTERN: $sp ---- $ep";
        $e = $sp;
        while ($e <= $ep)
        {
        print "$result[$e]-->";
        $e++;
        }
            $sp = $k;
            $ep = $k;
        }
    
    }
[download]

Let me add some detail to make this issue more clear.. The code is getting stuck at following section



my $u_l = @result;
my $i = 0;
my $max_score = -1;
my @score_board_matrix = ();
while($i < $u_l)
{
    my $score_board_column = "";
    my $j = 0;
    while($j<$u_l)
    {
        $score = 0;
        open(fh, "tim_icc_dec12b");

        $/ = "Startpoint:";
        while(<fh>)
        {
            my @d = &get_data_path_report ($_);
            if($#d >=1)
            {
                my $ss = &array_search("$result[$i]", @d);
                my $ee = &array_search("$result[$j]", @d);
                print " !!!! ### $ss ----- $result[$i] <----> $ee ----
+- $result[$j] #### !!!!!\n";
                if($ee == $ss+1)
                {
                    $score++;
                }
            }
        }
        close(fh);
        $score_board_column = $score_board_column." ".$score;
        if($score > $max_score)
        {
            $max_score = $score;
        } 
        print "\n---------------------------$result[$i] $result[$j]---
+-\t$score------> $score_board_column\n";
        $j++;
    }
    $i++;
    push (@score_board_matrix, $score_board_column);
    print "\n";
}
[download]

One sample of the target data file is as following :-


  Startpoint: sdram_clk (clock source 'SDRAM_CLK')
  Endpoint: sd_DQ_out[6]
            (output port clocked by SD_DDR_CLK)
  Path Group: COMBO
  Path Type: max

  Point                                       Fanout       Cap     Tra
+ns      Incr       Path
  --------------------------------------------------------------------
+--------------------------
  clock SDRAM_CLK (fall edge)                                         
+    3.750000   3.750000
  sdram_clk (in)                                                0.1849
+22  0.065438 & 3.815438 f
  sdram_clk (net)                              17     0.124019        
+    0.000000   3.815438 f
  I_SDRAM_TOP/sdram_clk (SDRAM_TOP)                                   
+    0.000000   3.815438 f
  I_SDRAM_TOP/sdram_clk (net)                         0.124019        
+    0.000000   3.815438 f
  I_SDRAM_TOP/I_SDRAM_IF/sdram_clk (SDRAM_IF)                         
+    0.000000   3.815438 f
  I_SDRAM_TOP/I_SDRAM_IF/sdram_clk (net)              0.124019        
+    0.000000   3.815438 f
  I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/I (bufbd7)              0.1878
+10  0.013919 & 3.829357 f
  I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/Z (bufbd7)              0.2331
+13  0.210904 & 4.040261 f
  I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 (net)    45 0.175550       
+    0.000000   4.040261 f
  I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/S (mx02d4)             0.2340
+98  0.003310 & 4.043571 f
  I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/Z (mx02d4)             0.9991
+21  0.776377   4.819948 f
  I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] (net)     1     0.475020        
+    0.000000   4.819948 f
  I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] (SDRAM_IF)                      
+    0.000000   4.819948 f
  I_SDRAM_TOP/sd_DQ_out[6] (net)                      0.475020        
+    0.000000   4.819948 f
  I_SDRAM_TOP/sd_DQ_out[6] (SDRAM_TOP)                                
+    0.000000   4.819948 f
  sd_DQ_out[6] (net)                                  0.475020        
+    0.000000   4.819948 f
  sd_DQ_out[6] (out)                                            0.9991
+21  0.010237 & 4.830185 f
  data arrival time                                                   
+               4.830185

  clock SD_DDR_CLK (rise edge)                                        
+    7.500000   7.500000
  clock network delay (ideal)                                         
+    1.598546   9.098545
  clock uncertainty                                                   
+    -0.100000  8.998545
  output external delay                                               
+    -2.000000  6.998545
  data required time                                                  
+               6.998545
  --------------------------------------------------------------------
+--------------------------
  data required time                                                  
+               6.998545
  data arrival time                                                   
+               -4.830185
  --------------------------------------------------------------------
+--------------------------
  slack (MET)                                                         
+               2.168359
[download]

This work I am trying to scan path from start from "Endpoint" to "data arrival time" & then break first column for each line in this section in one pattern, expand this pattern in +/- 2 range the scan each pattern in whole file which is having 1000 such path as shown in example to find out number of times each pattern is getting repeated.

I have tried the suggested the work around, but it is not working, can you please help me with correct set of code to resolve this issue..

Comment on How to manage a pattern matching & counting with big data file Select or Download Code

Replies are listed 'Best First'.

Re: How to manage a big file
by Athanasius (Archbishop) on Apr 23, 2014 at 07:56 UTC

Hello taj_ritesh,

As Anonymous Monk has pointed out, your get_patterns sub operates on local variables only, and therefore accomplishes nothing.

Here is another problem: You declare a variable as my @resilt = ();, but then attempt to access it as @result. If you had included:

use strict;
[download]

at the head of your script, Perl would have flagged this mistake for you (along with at least six other cases in which a variable is used without first being declared).

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: How to manage a big file
by AnomalousMonk (Archbishop) on Apr 23, 2014 at 08:24 UTC

Some very discursive comments after a very brief inspection...

my @resilt = (); while (<fh>) { my @a = (); @a = &get_data_path_report ($_); my %seen = (); push (@result, @a); @result = grep { !$seen{$_}++ } @result; } shift(@result);
[download]

You seem to have used lexicals pretty consistently, but then undermined their use by not enabling strictures — and warnings for good measure. See warnings and strict. Add these two lines at the very start of your program
use warnings;
use strict;
and then fix all the errors and warnings.

push (@result, @a); @result = grep { !$seen{$_}++ } @result;
[download]

The push statement is redundant. The statement
@result = grep { !$seen{$_}++ } @a;
would have the same effect, and a further simplification would be to use List::MoreUtils::uniq as in
@result = uniq get_data_path_report();
alone, all other statements in the while-loop being needless.The statement
use List::MoreUtils qw(uniq);
must be added at the start of the script to import uniq. (The function get_data_path_report() does not need to have $_ passed to it because the function takes no arguments — as far as I can see by quick inspection.)

The function

sub get_first_elements_of_string
{
    my @a = (split (' ' ,"$_[0]"));     
    return $a[0];
}
[download]

untested

sub get_first_elements_of_string {
    my ($first) = split ' ', $_[0], 2;     
    return $first;
    }
[download]

split

LIMIT

Updates:

Also WRT the while (<fh>) { ... } loop: As it stands in the OP, this loop reads and processes the entire file, but then only keeps the final record read for further processing. Is this what you intended? Did you perhaps intend something like
push @result, uniq get_data_path_report();
instead?

[reply]
[d/l]
[select]

Re: How to manage a big file
by Lennotoecom (Pilgrim) on Apr 23, 2014 at 07:47 UTC

UPDATE

Startpoint: sdram_clk (clock source 'SDRAM_CLK')
  Endpoint: sd_DQ_out[6]
            (output port clocked by SD_DDR_CLK)
  Path Group: COMBO
  Path Type: max

  Point                                       Fanout       Cap     Tra
+ns      Incr       Path
  --------------------------------------------------------------------
+--------------------------
  clock SDRAM_CLK (fall edge)                                         
+    3.750000   3.750000
  sdram_clk (in)                                                0.1849
+22  0.065438 & 3.815438 f
  sdram_clk (net)                              17     0.124019        
+    0.000000   3.815438 f
  I_SDRAM_TOP/sdram_clk (SDRAM_TOP)                                   
+    0.000000   3.815438 f
  I_SDRAM_TOP/sdram_clk (net)                         0.124019        
+    0.000000   3.815438 f
  I_SDRAM_TOP/I_SDRAM_IF/sdram_clk (SDRAM_IF)                         
+    0.000000   3.815438 f
  I_SDRAM_TOP/I_SDRAM_IF/sdram_clk (net)              0.124019        
+    0.000000   3.815438 f
  I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/I (bufbd7)              0.1878
+10  0.013919 & 3.829357 f
  I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/Z (bufbd7)              0.2331
+13  0.210904 & 4.040261 f
  I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 (net)    45 0.175550       
+    0.000000   4.040261 f
  I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/S (mx02d4)             0.2340
+98  0.003310 & 4.043571 f
  I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/Z (mx02d4)             0.9991
+21  0.776377   4.819948 f
  I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] (net)     1     0.475020        
+    0.000000   4.819948 f
  I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6] (SDRAM_IF)                      
+    0.000000   4.819948 f
  I_SDRAM_TOP/sd_DQ_out[6] (net)                      0.475020        
+    0.000000   4.819948 f
  I_SDRAM_TOP/sd_DQ_out[6] (SDRAM_TOP)                                
+    0.000000   4.819948 f
  sd_DQ_out[6] (net)                                  0.475020        
+    0.000000   4.819948 f
  sd_DQ_out[6] (out)                                            0.9991
+21  0.010237 & 4.830185 f
  data arrival time                                                   
+               4.830185

  clock SD_DDR_CLK (rise edge)                                        
+    7.500000   7.500000
  clock network delay (ideal)                                         
+    1.598546   9.098545
  clock uncertainty                                                   
+    -0.100000  8.998545
  output external delay                                               
+    -2.000000  6.998545
  data required time                                                  
+               6.998545
  --------------------------------------------------------------------
+--------------------------
  data required time                                                  
+               6.998545
  data arrival time                                                   
+               -4.830185
  --------------------------------------------------------------------
+--------------------------
  slack (MET)                                                         
+               2.168359
[download]

  clock SDRAM_CLK
  sdram_clk
  sdram_clk
  I_SDRAM_TOP/sdram_clk
  I_SDRAM_TOP/sdram_clk
  I_SDRAM_TOP/I_SDRAM_IF/sdram_clk
  I_SDRAM_TOP/I_SDRAM_IF/sdram_clk
  I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/I
  I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16/Z
  I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16
  I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/S
  I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6/Z
  I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6]
  I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out[6]
  I_SDRAM_TOP/sd_DQ_out[6]
  I_SDRAM_TOP/sd_DQ_out[6]
  sd_DQ_out[6]
  sd_DQ_out[6]
[download]

[reply]
[d/l]
[select]

Re^2: How to manage a big file

by taj_ritesh (Initiate) on Apr 23, 2014 at 09:42 UTC

Now each line is a potential pattern & we need to make four more pattern by adding next two & previous two lines, for example :-

Pattern (1):- I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 ( COnsider this as a seed pattern)

Pattern (2):- I_SDRAM_TOP/I_SDRAM_IF/sdram_clk I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16 I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 ( Seed +2)

Pattern (3):- I_SDRAM_TOP/I_SDRAM_IF/bufbdf_G1B1I16 I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 (seed +1)

Pattern (4):- I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6 (seed -1)

Pattern (5):- I_SDRAM_TOP/I_SDRAM_IF/sdram_clk_G1B1I16 I_SDRAM_TOP/I_SDRAM_IF/sd_mux_dq_out_6 I_SDRAM_TOP/I_SDRAM_IF/sd_DQ_out6 (seed -2)

Now these five patterns needs to scan in the master file & print each pattern wise occurrence count.

[reply]

Re^3: How to manage a big file

by Lennotoecom (Pilgrim) on Apr 23, 2014 at 09:51 UTC

open IN,'<1';
while (<IN>) {
        $a = 1 if /Endpoint/;
        $b = 1 if /-/ and $a;
        if($a and $b){
                if(/^\s+(.+)\s\(/){
                        $h{$1}{'x'} = ++$x if ! exists $h{$1};
                        $r .= $h{$1}{'x'}.".";
                }
        }
        $a = 0, $b = 0, ++$r{$r}, $r = '' if /data arrival time/ and $
+b;
}

close IN;

foreach (sort keys %r) {
        print "$_: $r{$_}\n";
}
[download]

1.2.2.3.3.4.4.5.6.7.8.9.10.10.11.11.12.12.: 8
[download]

update

update 2

[reply]
[d/l]
[select]

Re^4: How to manage a big file

by Anonymous Monk on Apr 23, 2014 at 10:04 UTC

Re: How to manage a big file
by Anonymous Monk on Apr 23, 2014 at 06:48 UTC

Can you please check the my code & help me to fix this, so it can run quickly

Not really , the purpose of the program is unclear .... you have included sub get_patterns but it does nothing

It would help if you could explain what problem the program is supposed to solve, and using what steps ... maybe provide short representative sample input ... things like that

[reply]
[d/l]

Re: How to manage a pattern matching & counting with big data file
by Anonymous Monk on Apr 23, 2014 at 14:05 UTC

array_search()

reference

[reply]