Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Extracting data from a messy file (slow performance)

by acidblood (Novice)
on Aug 06, 2008 at 19:54 UTC ( #702721=perlquestion: print w/replies, xml ) Need Help??

acidblood has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow Monks!

I've searched in perlmonks and google and couldn't find a solution. I'm still very new to perl, so a lot of the commands are new to me...

I have a text file that basically contains stats and I need to extract specific data from it. I have written a shell script, which works, however it takes 2 minutes to extract 1 record.

I'm running Cygwin on top of Vista - I know not exactly what you hoped - but this is what I have right now.

I think the performance problem is due to me grep'ing each line and checking it as well as cygwin.

The data:

----other lines to be ignored----

Wed Jul 23 17:00:00 GMT 2008 (to extract only hour & minute. 17:00)

----other lines to be ignored----

----other lines to be ignored----

----other lines to be ignored----

vmstat 2 60: (to ignore, but states starting point of data)

----2 other lines to be ignored----

----20 lines of data----

----2 other lines to be ignored----

----20 lines of data----

----2 other lines to be ignored----

----20 lines of data----

END (the characters END shows that collection was complete)

*The above data repeats 96 times at different intervals. Thus every 15 minutes in a day.

Explanation of the data I need:

1. The script needs to scan thru the file until it finds the date line containing either GMT or SAST entries. The hour and minute needs to be stored to a varible. i.e. int=17:00

2.Scan futher down the file until the text "vmstat 2 60" is found. This line show data will follow.

3. Ignore two heading lines, that each contain text "procs" and "avm" respectively

4. Hereafter 20 lines of data follow. I need to to extract columb 16 and 17 - and add them together.

If this was one of my data lines:

2 0 0 517725 4545 15 1 4 3 1 0 1 +38 2389 1783 213 3 2 95

I would want to add 3 and 2 to give me a value of 5.

There will be 60 lines of actual stats, which needs to be added together and divided by 60 to provide an average.

5. I would now like to write this as a record to a file.

Output should look like this:

17:00,5

My file will contain 96 records per day

Here follows an example of data from vmstat:

vmstat 2 60: procs memory page + faults cpu r b w avm free re at pi po fr de +sr in sy cs us sy id 2 0 0 517725 4545 15 1 4 3 1 0 1 +38 2389 1783 213 3 2 95 2 0 0 517725 5675 111 3 292 374 73 0 12 +900 2946 7497 327 11 10 78 2 0 0 517725 5669 71 1 188 239 46 0 82 +56 2035 4983 246 10 0 89 1 0 0 544051 5478 58 0 130 152 28 0 52 +83 1502 3696 202 0 2 98 1 0 0 544051 5477 36 0 84 96 17 0 33 +80 1116 2453 155 0 0 100 1 0 0 544051 5515 28 0 55 60 10 0 21 +63 884 1785 132 0 1 99 1 0 0 544051 5477 33 2 36 38 6 0 13 +84 741 2877 131 2 4 94 1 0 0 544051 5539 22 0 23 24 3 0 8 +85 625 1965 110 0 0 100 1 1 0 522972 5539 13 0 15 15 1 0 5 +66 551 1318 96 0 0 100 1 1 0 522972 5539 8 0 9 9 0 0 3 +61 500 966 89 12 0 88 1 1 0 522972 5535 20 1 11 5 0 0 2 +30 487 1059 103 0 1 98 1 1 0 522972 5535 20 1 8 3 0 0 1 +47 473 2430 99 2 3 95 1 1 0 522972 5514 21 0 14 1 0 0 +93 467 2225 99 1 0 99 1 1 0 385532 5023 82 1 28 0 0 0 +59 480 1760 147 5 7 88 1 1 0 385532 3745 70 0 54 0 0 0 +37 1734 2142 282 21 5 74 1 1 0 385532 5479 112 0 87 0 0 0 +23 1503 2859 331 4 8 88 1 1 0 385532 5407 86 1 58 0 0 0 +14 1557 3889 302 3 6 91 1 1 0 385532 5407 55 0 37 0 0 0 + 8 1153 2650 220 0 0 100 1 1 0 434602 5407 35 0 23 0 0 0 + 4 894 1795 167 0 0 100 1 1 0 434602 5407 22 0 14 0 0 0 + 2 725 1208 131 0 0 100 procs memory page + faults cpu r b w avm free re at pi po fr de +sr in sy cs us sy id 1 1 0 434602 5390 84 0 74 0 0 0 + 0 1321 1672 178 7 10 83 1 1 0 434602 5389 63 1 48 0 0 0 + 0 1245 2951 172 2 4 95 1 1 0 434602 5389 40 0 31 0 0 0 + 0 951 1982 135 0 0 100 1 1 0 370995 5389 25 0 19 0 0 0 + 0 766 1361 112 0 0 100 1 1 0 370995 4561 109 0 70 0 0 0 + 0 1125 1626 138 10 13 76 1 1 0 370995 5381 140 0 84 0 0 0 + 0 1906 4289 197 5 5 90 1 1 0 370995 5381 99 1 54 0 0 0 + 0 1468 4622 168 3 2 95 1 1 0 370995 5381 64 0 35 0 0 0 + 0 1105 3187 142 2 0 98 1 1 0 460130 5377 40 0 23 0 0 0 + 0 866 2177 127 0 0 100 1 1 0 460130 5378 117 0 65 0 0 0 + 0 819 2229 139 6 9 85 1 1 0 460130 5377 74 0 42 0 0 0 + 0 964 1564 145 0 0 100 1 1 0 460130 5377 47 0 26 0 0 0 + 0 776 1049 120 2 3 95 1 1 0 460130 5377 38 0 17 0 0 0 + 0 666 2198 111 0 0 100 1 1 0 491926 5377 24 0 11 0 0 0 + 0 580 1510 97 4 2 95 1 1 0 491926 5377 89 0 48 0 0 0 + 0 989 1686 150 1 7 91 1 1 0 491926 5377 56 0 31 0 0 0 + 0 789 1162 122 0 0 100 1 1 0 491926 5377 35 0 20 0 0 0 + 0 660 842 106 3 2 94 1 1 0 491926 5377 30 0 14 0 0 0 + 0 579 2037 100 2 1 96 2 0 0 327196 5378 93 0 50 0 0 0 + 0 973 2086 156 2 8 89 2 0 0 327196 5377 59 0 32 0 0 0 + 0 776 1426 126 0 0 100 procs memory page + faults cpu r b w avm free re at pi po fr de +sr in sy cs us sy id 2 0 0 327196 5377 37 0 20 0 0 0 + 0 650 965 106 0 0 100 2 0 0 327196 5377 23 0 13 0 0 0 + 0 566 693 92 4 4 92 2 0 0 327196 5377 97 1 50 0 0 0 + 0 978 2673 159 3 10 87 1 1 0 251674 5377 62 0 32 0 0 0 + 0 783 1801 136 0 0 100 1 1 0 251674 5377 39 0 21 0 0 0 + 0 655 1259 112 0 0 100 1 1 0 251674 5369 24 0 15 0 0 0 + 0 580 894 100 1 0 98 1 1 0 251674 5168 186 0 103 0 0 0 + 0 909 1955 152 9 13 78 1 1 0 251674 5420 130 1 67 0 0 0 + 0 776 3148 142 2 3 95 1 1 0 370259 5420 83 0 43 0 0 0 + 0 654 2105 119 0 0 100 1 1 0 370259 5382 57 0 27 0 0 0 + 0 602 1550 108 0 2 98 1 1 0 370259 5428 39 1 17 0 0 0 + 0 552 1183 102 1 1 98 1 1 0 370259 5428 29 1 11 0 0 0 + 0 507 1013 96 0 0 100 1 1 0 370259 5383 33 1 6 0 0 0 + 0 483 2661 102 3 4 93 1 1 0 466781 5428 24 1 4 0 0 0 + 0 581 1944 130 0 0 100 1 1 0 466781 5428 16 0 2 0 0 0 + 0 523 1337 107 0 0 100 1 1 0 466781 5423 9 0 3 0 0 0 + 0 487 909 93 0 1 99 1 1 0 466781 5397 11 0 1 0 0 0 + 0 505 823 91 0 0 100 1 1 0 466781 5395 30 3 2 0 0 0 + 0 515 2958 118 4 5 90 1 1 0 514735 5394 19 1 2 0 0 0 + 0 482 2044 116 0 0 100 1 1 0 514735 5394 12 0 1 0 0 0 + 0 466 1406 100 0 0 100 END

6. The text "END" shows the end of data. Hereafter we can search for the next data again.

Here is my shell script:

# This file is used to breakup the original file into useful informati +on # # File format: # int,cpu% >$1.out stat=0 cpu=0 scpu=0 rec=0 echo "Starting rebuild of $1 into $1.out" while read line do # Get date if [ `echo $line | egrep 'GMT|SAST' | wc -l` -eq 1 ] then int=`echo $line|cut -c12-16` fi # Get vmstat 2 60 data if [ `echo $line | grep "vmstat 2 60" | wc -l` -eq 1 ] then stat=1 fi # If stat=1 - entered into stat data if [ $stat -eq 1 ] && [ `echo $line | egrep 'vmstat|procs|avm|END' | w +c -l` -eq 0 ] then scpu=`echo $line | awk '{ print $16 "+" $17 }'|bc` cpu=`expr $cpu + $scpu` fi # END of data string if [ `echo $line | grep END | wc -l` -eq 1 ] then cpu=`expr $cpu / 60` # Write data line echo "$int,$cpu" >>$1.out stat=0 int=0 cpu=0 scpu=0 rec=`expr $rec + 1` echo "`date`:Wrote record: $rec" fi done < $1 echo "Complete!"

Is there anyone that can help?

I have a couple of files to do. According to my calculations, 41.6 hours to do all the files for one host. I have 9 to do, which gives me about 15.6 days??!!

Regards,

Acidblood

Replies are listed 'Best First'.
Re: Extracting data from a messy file (slow performance)
by kyle (Abbot) on Aug 06, 2008 at 20:30 UTC

    Here are some of the things I think you'll want to know as you write a Perl script to do what your shell script does:

    • Start with "use strict;use warnings;". Read Use strict and warnings for details. You'll have to use my because of this.
    • You'll probably want a loop that starts with "while ( my $line = <> )". You may want to use open.
    • chomp
    • split
    • Regular expressions will be very useful, so have a look at perlretut. For example:
      if ( $line =~ /(\d+:\d\d):\d\d (?:GMT|SAST)/ ) { my $the_hour_and_minute = $1; }
    • I'd be tempted to use the range operator to capture things between /^vmstat/ and /^END/, but using a flag as you have already will still work.
    • print, warn, die

    I'll note also that vmstat has a -n option which causes it to output the header only once. You could send that through tail before it ever gets to the log, and your processor would not have to deal with those lines at all.

Re: Extracting data from a messy file (slow performance)
by Corion (Patriarch) on Aug 06, 2008 at 20:01 UTC

    I suggest you rewrite that program in Perl, or awk. Where exactly are you having problems? This website is not a "script writing site" where we produce Perl scripts.

Re: Extracting data from a messy file (slow performance)
by Anonymous Monk on Aug 07, 2008 at 01:18 UTC
    At the risk of annoying the Perl Monks, and talking about not perl things, your script is not very optimised.

    1. Use AWK..it was designed for this sort of thing.
    2. If you really want to use shell, look into cutting down the number of external call to grep/egrep/wc, etc, and don't post here...
    i.e.
    if [ `echo $line | grep "vmstat 2 60" | wc -l` -eq 1 ]
    remove one external call to wc
    if [ `echo $line | grep -c "vmstat 2 60"` -eq 1 ]
    the same goes for setting 'stat=1'. Do it in one loop.
Re: Extracting data from a messy file (slow performance)
by ady (Deacon) on Aug 07, 2008 at 06:43 UTC
    A fast code skeleton to get you started.
    You've gotta do the proper file I/O, rounding and error handling to make this robust.
    Best regards / allan
      Hi All!

      Thank you very much for all your assistance! I'll be working on a perl script to get my processing done.

      This will be possible all thanks to all of you!

      I'll post the script when it's complete and working.

      Kind Regards,

      Acidblood

      Thank you ALL! - Especially to Allan!

      Here is my final working code if anyone was interested...

      #!/usr/bin/perl -w # Description: Custom script to strip stats from a messy file. # HP VERSION use strict; my $int; my $vmstat; my $total; my $sfile = pop or carp("Usage: strip.pl [file]"); my $ofile = "$sfile.out"; open ODATA,">$ofile" or carp("Can't open $ofile for writing"); open DATA,"<$sfile" or carp("Can't open $sfile"); while ( <DATA> ) { # 1. The script needs to scan thru the file until it finds the dat +e line containing # either GMT or SAST entries. The hour and minute needs to be stor +ed to a varible. # i.e. int=17:00 $int = $1 if ( /(\d{2}:\d{2}:\d{2})\s+(GMT|SAST)/ ); #$int = $1 if ( /(\d{2}:\d{2})\s+(GMT|SAST)/ ); # 2.Scan futher down the file until the text "vmstat 2 60" is foun +d. # This line show data will follow. if ( /^\s*vmstat 2 60/ ) { # There will be 60 lines of actual stats, which needs to be adde +d together # and divided by 60 to provide an average. $total = 0; for my $i (1..3) { # 3 x 20 # 3. Ignore two heading lines, that each contain text "procs" + and "avm" respectively <DATA> =~ /procs/ or die "Input error $_\n"; <DATA> =~ /avm/ or die "Input error $_\n"; # 4. Hereafter 20 lines of data follow. # I need to to extract columb 16 and 17 - and add them togeth +er. for my $i (1..20) { (<DATA> =~ / (\d+\s+){15}(\d+)\s+(\d+) /); my ($us, $sy) = ($2, $3); my $sum = $us + $sy; $total += $sum; #print "$i:\t$us, $sy, $sum, $total, ", $total/60, "\n"; } # 5. I would now like to write this as a record to a file. #my ($h, $m, $s) = split /:/, $int; #$s += $total / 60; #print "***$int ", join (':', $h,$m,$s); } #print "$i:\t$us, $sy, $sum, $total, ", $total/60, "\n"; my ($h, $m, $s) = split /:/, $int; $s += $total / 60; #print ODATA "\n***$int -> ", join (':', $h,$m,$s); print ODATA "$h:$m,$s\n"; } } close DATA; close ODATA;

      Great to know all of you!

      Kind regards,

      Acidblood

        acidblood,
        I have not read this thread. I just happened to see this node and something caught my eye - <DATA>.

        You probaly should avoid using <DATA> for a number of reasons:

        • Same filehandle as __DATA__ - see SelfLoader
        • Should use a lexical file handle - see open
        • Should be using 3 arg open - see open

        Cheers - L~R

Re: Extracting data from a messy file (slow performance)
by Bloodnok (Vicar) on Aug 07, 2008 at 16:48 UTC
    Sub-processes cost time in shell scripts - try using a case instead of the multiple if [...] statements - case is a built-in and is thus far faster than the sub-processes necessary to execute if [...].

    Maybe something (untested) along the lines of...

    >$1.out unset stat int cpu scpu echo "Starting rebuild of $1 into $1.out" while read line ; do case $line in END) cpu=`expr $cpu / 60` # Write data line echo "$int,$cpu" >>$1.out unset stat int cpu scpu rec=`expr $rec + 1` echo "`date`:Wrote record: $rec" ;; *GMT | \ *SAST) int=`echo $line|cut -c12-16` ;; vmstat\ 2\ 60:) stat=1 ;; procs |\ avm) : ;; *) scpu=`echo $line | awk '{ print $16 "+" $17 }'|bc` cpu=`expr $cpu + $scpu` ;; esac done < $1 echo "Complete!"

    HTH ,

    A user level that continues to overstate my experience :-))

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://702721]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2022-10-07 05:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My preferred way to holiday/vacation is:











    Results (29 votes). Check out past polls.

    Notices?