OzzyOsbourne has asked for the wisdom of the Perl Monks concerning the following question:

...And it hurts me terribly

I wrote a script a while back that goes out to all my servers and logs various media files to text files, one file per server. I wrote this script at the same time to sort those server files by media type. Recently added .wma files to the mix.

Most of the file types take a minute or less to complete, with the exception of WMA which can take 20 minutes. the wma log is never bigger than the other logs, so I don't understand the time difference unless there is something going in the regex that I just don't see.

I'll be reading up on this, but...

Anyone have any ideas?

#sifts all the server logs based on media type @servers=('XXXXX','YYYYYYY','ZZZZZZZ'); $dir1='//workstation/share/directory/'; @types=('mp3','avi','mpg','mpe','wav','mov','rmj','zip','exe','wma'); foreach $type (@types){ $total=0; open OUT, ">$dir1/sifted/$type\.txt" or die "Cannot open $out for +write :$!"; foreach $server (@servers){ $in="$dir1/$server\.txt"; open IN,"$in" or next; @input=<IN>; chomp @input; foreach (@input){ if (/\.$type/i){ $kbytes = (stat)[7]/1000; #GETS THE 9TH ELEMENT OF fil +e STAT - THE SIZE IN BYTES $total=$total+$kbytes; print OUT "$_\t$kbytes KB\n"; } } close $in; } $mbytes=$total/1000; print OUT "\n\nTotal: $mbytes MB\n"; close $out; print "Finished $type...\n"; }

Thanks again, everyone.

-OzzyOsbourne

Replies are listed 'Best First'.
Re (tilly) 1: I have Wma file jammed in my regex
by tilly (Archbishop) on Jan 19, 2001 at 19:27 UTC
    I would guess that the problem is that the slowness is in statting your wma files. Do you have more of those than others? Are they grouped on the same server? In a particularly big directory? Is stat failing on a lot of them?

    What follows is an untested but cleaned up version that works incrementally. Were I maintaining this long-term I might declare multiple passes through the log files (one for each type) to be a mistake and I would do one scan for all types at once. But if it is good enough, this is simple.

    use strict; #sifts all the server logs based on media type my @servers=('XXXXX','YYYYYYY','ZZZZZZZ'); my $dir1='//workstation/share/directory/'; my @types= qw(mp3 avi mpg mpe wav mov rmj zip exe wma); foreach my $type (@types){ my $total=0; my $out = "$dir1/sifted/$type.txt"; open (OUT, "> $out") or die "Cannot write to '$out':$!"; foreach my $server (@servers){ my $in = "$dir1/$server\.txt"; unless(open IN,"< $in") { warn " Cannot read from '$in': $!"; next; } my $re = qr/\.$type\z/i; # I assume this is what you want? while (<IN>) { chomp; if (/$re/){ # Get filesize my $kbytes = (stat)[7]/1024; if (defined($kbytes)) { $total += $kbytes; print OUT "$_\t$kbytes KB\n"; } else { print OUT "$_\tNOT FOUND\n"; } } } } my $mbytes = $total/1024; print OUT "\n\nTotal: $mbytes MB\n"; print "Finished $type...\n"; }
    BTW some points.
    1. You seem to have some misconceptions about what you are supposed to call close on, which have not been biting you because Perl has done a good job of figuring out when to call it itself.
    2. You used an 8-space indent. I recommend less. In studies the most "aesthetically pleasing" indent was 6. However comprehension appears to be best in the 2-4 range. Consistency matters more here than what particular choice you make. I happen to use 2.
    3. You obviously want failing to read a server file to be a graceful error. Even so you probably should be reporting it.
    4. I used qr// for the RE. This avoids compiling multiple times and is faster. Also by saying that it can only match at the end of the string the RE engine knows it can be smart and just jump to the end rather than scanning the whole string.
    5. Working incrementally through log files is much more memory efficient than slurping them into memory.
    6. There are 1024 (ie 2**10) bytes in a K, and 1024 K in a Meg.
    7. Just adding strict on this caught several real mistakes. (Such as your writing to a different filename than would have been reported in your die.)
    8. I prefer having explicit statements of when things were not found. That provides something you can grep for later.
      There are 1024 (ie 2**10) bytes in a K, and 1024 K in a Meg.

      Aren't there only 1000 K in a Megabyte? I thought the reason for making 1024B/K didn't apply in the K2M case..

      --

      (nit) (nit) (nit) (nit) (nit) (nit) ^ | +--------------I pick this one!
      --

        No, strictly speaking there are 1024 bytes per kilobyte, 1024 kilobytes per megabyte, 1024 megabytes per gigabyte, etc. The only people who regularly change these rules work in the marketing department of hard drive manufacturers.

Re: I have Wma file jammed in my regex
by Trimbach (Curate) on Jan 19, 2001 at 18:57 UTC
    I don't know if this will solve your problem, but it seems like you're opening each server log 1x for each data type, which is a pretty big waste of time seeing as how your media type list is short, and your server logs are (almost certainly) alot bigger. This is extremely inefficient.

    A better way to do it would be to put the loop for your servers on the OUTSIDE, and then loop (or regex, or whatever) through each data type. This means your computer doesn't have to open (and read!) each server log more than once. Kinda like this pseudo-code:

    for $server(@servers) { my (@open, %found_total); # Initialize for each run open IN, $server; while (<IN>) { # Keeps from loading the whole file in for $type($mediatypes) { if (/\.$type) { my $kbytes = (stat)[7]/1000; $found_total{$1} += $kbytes; push @output, ("$_ : $kbytes"); } } } # Output your @output array here if you want, # and/or %found_total (which contains the kbyte totals) }
    You'll probably be able to speed this up even more if you just put in an all encompassing regex instead of a loop for the media types. But however you do it you'll be better off putting the big things on the outside, and the little things on the inside.

    Gary Blackburn
    Trained Killer

Re: I have Wma file jammed in my regex
by ChOas (Curate) on Jan 19, 2001 at 18:41 UTC
    Hey!!!

    No offence, but is this pseudo code ??
    I mean it doesn't pass -w/strict in a million
    years...

    And I really want to help, but right now I wouldn't
    know where to start hacking into it
    I have to admit I hacked a bit, and now it passes strict
    and -w...

    My advice would be to join up your @types, and then optimize
    your regex with /o... other than that...

    maybe use this for your inner loop (untested, but I'm sure it's close)
    foreach my $server (@servers) { my $in="$server\.txt"; open IN,"<$in" or next; my @CurType=grep /\.$type/i, <IN>; close IN; print OUT $total+=(stat)[7]/1000," KB\n" for @CurType; }

    or maybe going per server instead of per type
    but I'm having too much difficulty with this code to REALLY
    optimize it (or see where you can gain speed otherwise) ;))

    Hoped this helped though

    GreetZ!,
      ChOas

    print "profeth still\n" if /bird|devil/;
Re: I have Wma file jammed in my regex
by OzzyOsbourne (Chaplain) on Jan 19, 2001 at 21:23 UTC

    Thanks for the pointers. I was embarrassed by all the responses, and I probably shouldn't have posted it. All of the responses were helpful in making changes to the code, but they did not solve the problem.

    I tried to find out what pseudo code was, but I think that that may have been some sort of insult.

    I'll get around to updating this properly when I figure out what's going on.

    Thanks again!

    #sifts all the server logs based on media type use strict; my ($type, $server,$out,$in,@input,$total,$kbytes,$mbytes); my @servers=('a','b','c','d','e','f','g','h','i','j','k','l','m','n',' +o','p','q','r','s','t','u','v','w','x','y','z','aa','bb','cc','dd','e +e','ff','gg','hh','ii','jj','kk','ll','mm','nn','oo','pp'); my $dir1='//worksatation/share/dir/'; my @types=('mp3','avi','mpg','mpe','wav','mov','rmj','zip','exe','wma' +); foreach $type (@types){ $total=0; open OUT, ">$dir1/sifted/$type\.txt" or die "Cannot open $out for +write :$!"; foreach $server (@servers){ $in="$dir1/$server\.txt"; open IN,"$in" or next; @input=<IN>; chomp @input; foreach (@input){ if (/\.$type$/i){ $kbytes = (stat)[7]/1024; $total+=$kbytes; print OUT "$_\t$kbytes KB\n"; } } close IN; } $mbytes=$total/1024; print OUT "\n\nTotal: $mbytes MB\n"; close OUT; print "Finished $type...\n"; }

    -OzzyOsbourne

      OzzyOsbourne wrote:
      I tried to find out what pseudo code was, but I think that that may have been some sort of insult.
      Pseudo-code is an informal "shorthand" method of designing a program. If you look in this post, you'll see two code samples, neither of which is code. The first is a primitive Warnier-Orr design and the second is a pseudo-code representation of a program of similar functionality. It's not an insult. Pseudo-code may be quite rigorous and strongly resemble the language that you plan to code in, or it may be very similar to natural language.
      while I read next line from file if line contains extension listed in my extension list write line to output file add one to lines copied else write line to error report add one to error count end if end while
      The above is a snippet of pseudo code that anyone who programs should be able to read. It's an easy way to understand your logic. After the pseudo-code is written, it's just a matter of translating to the finished output (preferably Perl :-)

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.