Extracting data from a messy file (slow performance)

acidblood has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow Monks!

I've searched in I have a text file I'm running Cygwin I think the performance The data:

----other lines to be ignored----

Wed Jul 23 17:00:00 GMT 2008 ----other lines to be ignored----

----other lines to be ignored----

vmstat 2 60: (to ----2 other lines to be ignored----

----20 lines of data----

----2 other lines to be ignored----

----20 lines of data----

----2 other lines to be ignored----

----20 lines of data----

END (the *The above data repeats

Explanation of the data I need:

1. The script needs 2.Scan futher down 3. Ignore two heading 4. Hereafter 20 lines If this was one of my data lines:

+38 
 6. The text "END"

Here is my shell script: 
+on # # File format: # int,cpu%

>$1.out stat=0 cpu=0 scpu=0 rec=0

echo "Starting

while read line do

# Get date if [ `echo $line then int=`echo $line|cut -c12-16` fi

# Get vmstat 2 60 data if [ `echo $line then stat=1 fi

# If stat=1 - entered into stat data if [ $stat -eq +c then scpu=`echo cpu=`expr $cpu + $scpu` fi

# END of data string if [ `echo $line then cpu=`expr $cpu / 60`

# Write data line echo "$int,$cpu" stat=0 int=0 cpu=0 scpu=0 rec=`expr $rec + 1` echo "`date`:Wrote

fi

done < $1

echo "Complete!" 
 that basically contains stats and I need to extract specific data from it. I have written a shell script, which works, however it takes 2 minutes to extract 1 record.  on top of Vista - I know not exactly what you hoped - but this is what I have right now.
 problem is due to me grep'ing each line and checking it as well as cygwin.
 (to extract only hour & minute. 17:00)
 ignore, but states starting point of data)
 characters END shows that collection was complete)
 96 times at different intervals. Thus every 15 minutes in a day.
 to scan thru the file until it finds the date line containing either GMT or SAST entries. The hour and minute needs to be stored to a varible. i.e. int=17:00  
 the file until the text "vmstat 2 60" is found. This line show data will follow. 
 lines, that each contain text "procs" and "avm" respectively
 of data follow. I need to to extract columb 16 and 17 - and add them together.
 class='codeblock'>    2     0     0   517725    4545   15    1     4    3     1    0   1 2389   1783   213   3  2 95 ed-code-dl'>[download]
 3 and 2 to give me a value of 5.  lines of actual stats, which needs to be added together and divided by 60 to provide an average.
 to write this as a record to a file.
 96 records per day
 of data from vmstat:
 class='codeblock'>vmstat 2 60: memory                   page                 faults       cpu avm    free   re   at    pi   po    fr   de     in     sy    cs  us sy id 4545   15    1     4    3     1    0   1 2389   1783   213   3  2 95 5675  111    3   292  374    73    0  12 2946   7497   327  11 10 78 5669   71    1   188  239    46    0  82 2035   4983   246  10  0 89 5478   58    0   130  152    28    0  52 1502   3696   202   0  2 98 5477   36    0    84   96    17    0  33 1116   2453   155   0  0 100 5515   28    0    55   60    10    0  21 884   1785   132   0  1 99 5477   33    2    36   38     6    0  13 741   2877   131   2  4 94 5539   22    0    23   24     3    0   8 625   1965   110   0  0 100 5539   13    0    15   15     1    0   5 551   1318    96   0  0 100 5539    8    0     9    9     0    0   3 500    966    89  12  0 88 5535   20    1    11    5     0    0   2 487   1059   103   0  1 98 5535   20    1     8    3     0    0   1 473   2430    99   2  3 95 5514   21    0    14    1     0    0     467   2225    99   1  0 99 5023   82    1    28    0     0    0     480   1760   147   5  7 88 3745   70    0    54    0     0    0     1734   2142   282  21  5 74 5479  112    0    87    0     0    0     1503   2859   331   4  8 88 5407   86    1    58    0     0    0     1557   3889   302   3  6 91 5407   55    0    37    0     0    0     1153   2650   220   0  0 100 5407   35    0    23    0     0    0     894   1795   167   0  0 100 5407   22    0    14    0     0    0     725   1208   131   0  0 100 memory                   page                 faults       cpu avm    free   re   at    pi   po    fr   de     in     sy    cs  us sy id 5390   84    0    74    0     0    0     1321   1672   178   7 10 83 5389   63    1    48    0     0    0     1245   2951   172   2  4 95 5389   40    0    31    0     0    0     951   1982   135   0  0 100 5389   25    0    19    0     0    0     766   1361   112   0  0 100 4561  109    0    70    0     0    0     1125   1626   138  10 13 76 5381  140    0    84    0     0    0     1906   4289   197   5  5 90 5381   99    1    54    0     0    0     1468   4622   168   3  2 95 5381   64    0    35    0     0    0     1105   3187   142   2  0 98 5377   40    0    23    0     0    0     866   2177   127   0  0 100 5378  117    0    65    0     0    0     819   2229   139   6  9 85 5377   74    0    42    0     0    0     964   1564   145   0  0 100 5377   47    0    26    0     0    0     776   1049   120   2  3 95 5377   38    0    17    0     0    0     666   2198   111   0  0 100 5377   24    0    11    0     0    0     580   1510    97   4  2 95 5377   89    0    48    0     0    0     989   1686   150   1  7 91 5377   56    0    31    0     0    0     789   1162   122   0  0 100 5377   35    0    20    0     0    0     660    842   106   3  2 94 5377   30    0    14    0     0    0     579   2037   100   2  1 96 5378   93    0    50    0     0    0     973   2086   156   2  8 89 5377   59    0    32    0     0    0     776   1426   126   0  0 100 memory                   page                 faults       cpu avm    free   re   at    pi   po    fr   de     in     sy    cs  us sy id 5377   37    0    20    0     0    0     650    965   106   0  0 100 5377   23    0    13    0     0    0     566    693    92   4  4 92 5377   97    1    50    0     0    0     978   2673   159   3 10 87 5377   62    0    32    0     0    0     783   1801   136   0  0 100 5377   39    0    21    0     0    0     655   1259   112   0  0 100 5369   24    0    15    0     0    0     580    894   100   1  0 98 5168  186    0   103    0     0    0     909   1955   152   9 13 78 5420  130    1    67    0     0    0     776   3148   142   2  3 95 5420   83    0    43    0     0    0     654   2105   119   0  0 100 5382   57    0    27    0     0    0     602   1550   108   0  2 98 5428   39    1    17    0     0    0     552   1183   102   1  1 98 5428   29    1    11    0     0    0     507   1013    96   0  0 100 5383   33    1     6    0     0    0     483   2661   102   3  4 93 5428   24    1     4    0     0    0     581   1944   130   0  0 100 5428   16    0     2    0     0    0     523   1337   107   0  0 100 5423    9    0     3    0     0    0     487    909    93   0  1 99 5397   11    0     1    0     0    0     505    823    91   0  0 100 5395   30    3     2    0     0    0     515   2958   118   4  5 90 5394   19    1     2    0     0    0     482   2044   116   0  0 100 5394   12    0     1    0     0    0     466   1406   100   0  0 100 ed-code-dl'>[download]

shows the end of data. Hereafter we can search for the next data again. class='codeblock'># This file is used to breakup the original file into useful informati rebuild of $1 into $1.out" | egrep 'GMT|SAST' | wc -l` -eq 1 ] | grep "vmstat 2 60" | wc -l` -eq 1 ] 1 ] && [ `echo $line | egrep 'vmstat|procs|avm|END' | w -l` -eq 0 ] $line | awk '{ print $16 "+" $17 }'|bc` | grep END | wc -l` -eq 1 ] >>$1.out record: $rec" ed-code-dl'>[download]

Replies are listed 'Best First'.

Re: Extracting data from a messy file (slow performance)
by kyle (Abbot) on Aug 06, 2008 at 20:30 UTC

Here are some of the things I think you'll want to know as you write a Perl script to do what your shell script does:

Start with "use strict;use warnings;". Read Use strict and warnings for details. You'll have to use my because of this.
You'll probably want a loop that starts with "while ( my $line = <> )". You may want to use open.
chomp
split

Regular expressions will be very useful, so have a look at perlretut. For example:

if ( $line =~ /(\d+:\d\d):\d\d (?:GMT|SAST)/ ) {
    my $the_hour_and_minute = $1;
}
[download]

I'd be tempted to use the range operator to capture things between /^vmstat/ and /^END/, but using a flag as you have already will still work.
print, warn, die

I'll note also that vmstat has a -n option which causes it to output the header only once. You could send that through tail before it ever gets to the log, and your processor would not have to deal with those lines at all.

[reply]
[d/l]
[select]

Re: Extracting data from a messy file (slow performance)
by Corion (Patriarch) on Aug 06, 2008 at 20:01 UTC

I suggest you rewrite that program in Perl, or awk. Where exactly are you having problems? This website is not a "script writing site" where we produce Perl scripts.

[reply]

Re: Extracting data from a messy file (slow performance)
by Anonymous Monk on Aug 07, 2008 at 01:18 UTC

if [ `echo $line | grep "vmstat 2 60"  | wc -l` -eq 1 ]
[download]

if [ `echo $line | grep -c "vmstat 2 60"` -eq 1 ]
[download]

[reply]
[d/l]
[select]

Re: Extracting data from a messy file (slow performance)
by ady (Deacon) on Aug 07, 2008 at 06:43 UTC

Read more... (13 kB)

[reply]
[d/l]

Re^2: Extracting data from a messy file (slow performance)

by acidblood (Novice) on Aug 07, 2008 at 09:00 UTC

Thank you very much for all your assistance! I'll be working on a perl script to get my processing done.

This will be possible all thanks to all of you!

I'll post the script when it's complete and working.

Kind Regards,

Acidblood

[reply]

Re^2: Extracting data from a messy file (slow performance)

by acidblood (Novice) on Aug 08, 2008 at 12:33 UTC

Here is my final working code if anyone was interested...

 
#!/usr/bin/perl -w
# Description: Custom script to strip stats from a messy file.
# HP VERSION
use strict;

my $int;
my $vmstat;
my $total;

my $sfile = pop or carp("Usage: strip.pl [file]");

my $ofile = "$sfile.out";
open ODATA,">$ofile" or carp("Can't open $ofile for writing");

open DATA,"<$sfile" or carp("Can't open $sfile");
while ( <DATA> ) {
    # 1. The script needs to scan thru the file until it finds the dat
+e line containing
    # either GMT or SAST entries. The hour and minute needs to be stor
+ed to a varible.
    # i.e. int=17:00
    $int = $1 if ( /(\d{2}:\d{2}:\d{2})\s+(GMT|SAST)/ );
    #$int = $1 if ( /(\d{2}:\d{2})\s+(GMT|SAST)/ );
    
    # 2.Scan futher down the file until the text "vmstat 2 60" is foun
+d.
    # This line show data will follow.
    if ( /^\s*vmstat 2 60/ ) {
      
      # There will be 60 lines of actual stats, which needs to be adde
+d together
      # and divided by 60 to provide an average.
      $total = 0;
      for my $i (1..3) {   # 3 x 20
    
         # 3. Ignore two heading lines, that each contain text "procs"
+ and "avm" respectively
         <DATA> =~ /procs/ or die "Input error $_\n";
         <DATA> =~ /avm/ or die "Input error $_\n"; 
         # 4. Hereafter 20 lines of data follow.
         # I need to to extract columb 16 and 17 - and add them togeth
+er.
         for my $i (1..20) {
            (<DATA> =~ / (\d+\s+){15}(\d+)\s+(\d+) /);
            my ($us, $sy) = ($2, $3);
            my $sum = $us + $sy;
            $total += $sum;
            #print "$i:\t$us, $sy, $sum, $total, ", $total/60, "\n";
        
         }
         # 5. I would now like to write this as a record to a file.
         #my ($h, $m, $s) = split /:/, $int;
         #$s += $total / 60;
         #print "***$int ", join (':', $h,$m,$s);
      }
            #print "$i:\t$us, $sy, $sum, $total, ", $total/60, "\n";
         my ($h, $m, $s) = split /:/, $int;
         $s += $total / 60;
         #print ODATA "\n***$int -> ", join (':', $h,$m,$s);
         print ODATA "$h:$m,$s\n";
   }
}

close DATA;
close ODATA;
[download]

Great to know all of you!

Kind regards,

Acidblood

[reply]
[d/l]

Re^3: Extracting data from a messy file (slow performance)

by Limbic~Region (Chancellor) on Aug 08, 2008 at 15:43 UTC

acidblood

You probaly should avoid using <DATA> for a number of reasons:

Same filehandle as __DATA__ - see SelfLoader
Should use a lexical file handle - see open
Should be using 3 arg open - see open

Cheers - L~R

[reply]

Re^4: Extracting data from a messy file (slow performance)

by acidblood (Novice) on Aug 09, 2008 at 20:35 UTC

Re: Extracting data from a messy file (slow performance)
by Bloodnok (Vicar) on Aug 07, 2008 at 16:48 UTC

case

if [...]

case

if [...]

Maybe something (untested) along the lines of...

>$1.out
unset stat int cpu scpu

echo "Starting rebuild of $1 into $1.out"

while read line ; do
    case $line in
        END)
            cpu=`expr $cpu / 60`

            # Write data line
            echo "$int,$cpu" >>$1.out
            unset stat int cpu scpu
            rec=`expr $rec + 1`
            echo "`date`:Wrote record: $rec"
            ;;
        *GMT | \
        *SAST)
            int=`echo $line|cut -c12-16`
            ;;
        vmstat\ 2\ 60:)
            stat=1
            ;;
        procs |\
        avm)
            :
            ;;
        *)  
            scpu=`echo $line | awk '{ print $16 "+" $17 }'|bc`
            cpu=`expr $cpu + $scpu`
            ;;
    esac
done < $1

echo "Complete!"
[download]

HTH ,

A user level that continues to overstate my experience :-))

[reply]
[d/l]
[select]


go ahead... be a heretic
	PerlMonks