Optimum method to perform data extraction in a table

perlpal has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks ,

I need to process the output of a command which is in the format of a table. Depending on specified counter information , i need to extract entries corresponding to that counter.

An output instance is mentioned below :

Timestamp       10.72.184.159:cpu_busy
--------------------------------------------------
2010-01-05 22:49:03     1.707
2010-01-05 22:50:04     1.753
2010-01-05 22:51:03     1.994
2010-01-05 22:52:03     1.726
2010-01-05 22:53:03     1.783
2010-01-05 22:54:03     1.733
2010-01-05 22:55:03     1.742
2010-01-05 22:56:03     1.902
2010-01-05 22:57:03     1.902

Timestamp       10.72.184.159:disk_data_written
-----------------------------------------------------
2010-01-05 22:49:03     47.467
2010-01-05 22:50:04     43.148
2010-01-05 22:51:03     47.186
2010-01-05 22:52:03     45.867
2010-01-05 22:53:03     47.333
2010-01-05 22:54:03     47.067
2010-01-05 22:55:03     42.400
2010-01-05 22:56:03     46.533

Timestamp       10.72.184.159:disk_data_read
---------------------------------------------------------
2010-01-05 22:49:03     13.467
2010-01-05 22:50:04     10.557
2010-01-05 22:51:03     10.712
2010-01-05 22:52:03     10.733
2010-01-05 22:53:03     10.667
2010-01-05 22:54:03     12.667
2010-01-05 22:55:03     10.133
2010-01-05 22:56:03     10.000
2010-01-05 22:57:03     10.133
[download]

To extract the data with respect to the counter "10.72.184.159:disk_data_written" , i have written the following code where $cmd_out contains the output in a string:

my @out_arr = split (/\n/,$cmd_out);

my @ts_arr;
my $switch = 0;

foreach (@out_arr){

        if (/.*?10.72.184.159:disk_data_written.*/){

                $switch = 1;
        }
        if(($switch == 1) && ($_ !~ /^\s*$/)){

                push @ts_arr,$_;

        }
        if(($switch == 1) && (/^\s*$/)){

                last;
        }

}

print "\n The timestamp array consists of : \n";

print @ts_arr;
[download]

Is there a more optimized way to achieve the same ? Thanks in advance.

Comment on Optimum method to perform data extraction in a table Select or Download Code

Replies are listed 'Best First'.
Re: Optimum method to perform data extraction in a table by shmem (Chancellor) on Jan 07, 2010 at 11:46 UTC
Is there a more optimized way to achieve the same ? Yes, there is. See Range Operators: `foreach (@out_arr){ if (/.?10.72.184.159:disk_data_written./ .. /^\s*$/){ push @ts_arr, $_; } }` [download]	[reply] [d/l]
Re: Optimum method to perform data extraction in a table by moritz (Cardinal) on Jan 07, 2010 at 11:38 UTC
If the data file is much larger than what you have shown us, your program might benefit from reading from the source line by line (instead of reading the whole file once and split it on newline, as you currently do it). Apart from that I see some potential for micro optimizations like using index instead of some regexes - but is the trouble really worth it? Is that program actually slow? it doesn't look to me like it should be slow. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^2: Optimum method to perform data extraction in a table by perlpal (Scribe) on Jan 08, 2010 at 06:07 UTC
Currently the program is not slow. but as you pointed out , the data file entries increase by leaps and bounds over a period of time during which the program does become slow. Thank you for the inputs.	[reply]
Re: Optimum method to perform data extraction in a table by marto (Cardinal) on Jan 07, 2010 at 11:50 UTC
Is this your entire program? Have you profiled it using something like Devel::NYTProf? Martin	[reply]
Re: Optimum method to perform data extraction in a table by thundergnat (Deacon) on Jan 07, 2010 at 14:34 UTC
Since you are looking for paragraphs of information, try reading in using paragraph mode. Make sure to localize it to prevent problems with other IO. use warnings; use strict; my $ts = '10.72.184.159:disk_data_written'; { local $/ = ''; # paragraph mode while ( my $para = <DATA> ) { next unless $para =~ /$ts/; $para =~ s/^Timestamp.+\n//; $para =~ s/^--+//; print "\nThe timestamp array for \"$ts\" consists of :\n$para" +; } } __DATA__ Timestamp 10.72.184.159:cpu_busy -------------------------------------------------- 2010-01-05 22:49:03 1.707 2010-01-05 22:50:04 1.753 2010-01-05 22:51:03 1.994 2010-01-05 22:52:03 1.726 2010-01-05 22:53:03 1.783 2010-01-05 22:54:03 1.733 2010-01-05 22:55:03 1.742 2010-01-05 22:56:03 1.902 2010-01-05 22:57:03 1.902 Timestamp 10.72.184.159:disk_data_written ----------------------------------------------------- 2010-01-05 22:49:03 47.467 2010-01-05 22:50:04 43.148 2010-01-05 22:51:03 47.186 2010-01-05 22:52:03 45.867 2010-01-05 22:53:03 47.333 2010-01-05 22:54:03 47.067 2010-01-05 22:55:03 42.400 2010-01-05 22:56:03 46.533 Timestamp 10.72.184.159:disk_data_read --------------------------------------------------------- 2010-01-05 22:49:03 13.467 2010-01-05 22:50:04 10.557 2010-01-05 22:51:03 10.712 2010-01-05 22:52:03 10.733 2010-01-05 22:53:03 10.667 2010-01-05 22:54:03 12.667 2010-01-05 22:55:03 10.133 2010-01-05 22:56:03 10.000 2010-01-05 22:57:03 10.133 [download]	[reply] [d/l]
Re: Optimum method to perform data extraction in a table by Marshall (Canon) on Jan 08, 2010 at 23:44 UTC
I found your program logic a bit overly complex. In particular, this $switch stuff is not needed. I show another way below. Your code does work, but you have asked for suggestions so I feel free to do that. I also incorporated the suggestion from moritz to do this line by line instead of building an array. I call a subroutine to print the record instead of using a local variable like $switch to keep track of whether or not we are inside of a record. My "while" statement in print_record() may look at bit obtuse, but it just loops until a line with only whitespace is found (a blank line). You asked: Is there a more optimized way to achieve the same?. To me the most important optimization you can make is clarity of the code. A few peformance comments: The code below will run basically as fast as you can read in the input data file. If you want to terminate the reading of input data after a single record is found, there are a number of ways to deal with that like like even just putting an exit(0) in the print_record() sub. There are a number of ways to optimize and organize a multiple record search in a single pass through the data file. But I'm not sure that this is a requirement for you? The single most important performance issue when dealing with a large sequential data set is how many times you have to read it. #!/usr/bin/perl -w use strict; while ( <DATA> ) { print_record ($_) if /.?10.72.184.159:disk_data_written./; } sub print_record { my $time_stamp_header = shift; print " The timestamp array consists of :\n". "$time_stamp_header"; while ( (my $line =<DATA>) !~ /^\s*$/) { print $line; #prints lines until a blank line } } =output of the above The timestamp array consists of : Timestamp 10.72.184.159:disk_data_written ----------------------------------------------------- 2010-01-05 22:49:03 47.467 2010-01-05 22:50:04 43.148 2010-01-05 22:51:03 47.186 2010-01-05 22:52:03 45.867 2010-01-05 22:53:03 47.333 2010-01-05 22:54:03 47.067 2010-01-05 22:55:03 42.400 2010-01-05 22:56:03 46.533 =cut __DATA__ Timestamp 10.72.184.159:cpu_busy -------------------------------------------------- 2010-01-05 22:49:03 1.707 2010-01-05 22:50:04 1.753 2010-01-05 22:51:03 1.994 2010-01-05 22:52:03 1.726 2010-01-05 22:53:03 1.783 2010-01-05 22:54:03 1.733 2010-01-05 22:55:03 1.742 2010-01-05 22:56:03 1.902 2010-01-05 22:57:03 1.902 Timestamp 10.72.184.159:disk_data_written ----------------------------------------------------- 2010-01-05 22:49:03 47.467 2010-01-05 22:50:04 43.148 2010-01-05 22:51:03 47.186 2010-01-05 22:52:03 45.867 2010-01-05 22:53:03 47.333 2010-01-05 22:54:03 47.067 2010-01-05 22:55:03 42.400 2010-01-05 22:56:03 46.533 Timestamp 10.72.184.159:disk_data_read --------------------------------------------------------- 2010-01-05 22:49:03 13.467 2010-01-05 22:50:04 10.557 2010-01-05 22:51:03 10.712 2010-01-05 22:52:03 10.733 2010-01-05 22:53:03 10.667 2010-01-05 22:54:03 12.667 2010-01-05 22:55:03 10.133 2010-01-05 22:56:03 10.000 2010-01-05 22:57:03 10.133 [download]	[reply] [d/l]