deprecated has asked for the wisdom of the Perl Monks concerning the following question:

Ah, fellow monks, I am a depraved regex junky. I just cant get enough and find myself coding them for the heck of it. Is there help for me? But, I digress. Let us examine this code:
my ($ptid, $total, $used, $avail, $pct, $mp) = $element =~ m!^(/dev/.+[0-9]+) # which partition $ptid \s+([0-9.]+[MGK]) # total size of parition $total \s+([0-9.]+[MGK]) # used space $used \s+([0-9.]+[MGK]) # available space $avail \s+(\d{2})% # percent usage $pct \s+(.*)$!x; # mounting point $mp
This regex is being used to match this data:
---- Sat Feb 3 12:01:01 EST 2001 Filesystem Size Used Avail Use% Mounted on /dev/hda7 904M 261M 598M 30% / /dev/hda12 852M 378M 474M 44% /devel /dev/hda10 9.8G 9.6G 256M 97% /home /dev/hda9 1.8G 1.6G 225M 88% /home/dl /dev/hda5 768M 751M 17M 98% /mnt/macos /dev/hda8 3.9G 3.4G 304M 92% /usr /dev/hda6 387M 93M 275M 25% /var /dev/hdb5 1008M 591M 365M 62% /home/ftp /dev/hdb6 1008M 209M 748M 22% /home/httpd /dev/hdb9 1.5G 1.1G 358M 75% /mnt/build /dev/hdb8 640M 456M 151M 75% /mnt/mp3
which is being repeated on an hourly cronjob. So this could easily turn into several megs (or even dozens of megs) of text. Therefore, speed will be an issue.

So I'm looking at this and see a pretty specific regex. I thought of substituting \S+ for .*. However, in unix (nt compatibility, obviously, is not an issue here) mounting points can include awful characters like *, \n, \a, and so on. So, basically, I see two flaws to the expression. First, the use of .* (and .+), and second the part where \s+([0-9.]+[MGK]) is captured seems repetitive. Has anyone got some regex-tuning hints here?

thanks,
brother dep.

--
i am not cool enough to have a signature.

Replies are listed 'Best First'.
Re (tilly) 1: Getting rid of (.*) from a not-quite-complex regex.
by tilly (Archbishop) on Feb 03, 2001 at 23:01 UTC
    my ($ptid, $total, $used, $avail, $pct, $mp) = split /\s+/, $element, +6;
    Also unpack would be natural for this.
      You don't want to assume "fixed-width" columns; if the numbers are large then the format breaks. The whitespace is normally more consistent. Hey, is there a perl interface to statfs(2)?
Re: Getting rid of (.*) from a not-quite-complex regex.
by lemming (Priest) on Feb 03, 2001 at 23:13 UTC
    What about using split?
    next unless $element =~ m!^/!; my ($ptid, $total, $used, $avail, $pct, $mp) = split(' ', $element, 6);
    I assumed there was a while loop, hence the next. Since we shouldn't be worried about spaces in any of the fields except for the sixth field we specify to only split six times in case you have a space in your mount point. yech.
Re: Getting rid of (.*) from a not-quite-complex regex.
by dws (Chancellor) on Feb 03, 2001 at 23:45 UTC

    If you want a bit more precision than split, consider using unpack. The psgrep example on page 37 of Perl Cookbook is right on point for what you're doing.

    That example wraps a number of tricks into a small script, and is well worth the time to study.

Re: Getting rid of (.*) from a not-quite-complex regex.
by chipmunk (Parson) on Feb 04, 2001 at 22:43 UTC
    You say don't want to replace the .* at the end with \S+, because the last column could contain arbitrary columns. However, that's not a problem for the first column, so you can replace the .+:
    my ($ptid, $total, $used, $avail, $pct, $mp) = $element =~ m!^(/dev/\S+) # which partition $ptid \s+([\d.]+[MGK]) # total size of parition $total \s+([\d.]+[MGK]) # used space $used \s+([\d.]+[MGK]) # available space $avail \s+(\d{1,3})% # percent usage $pct \s+(.*)$!x; # mounting point $mp
    This will save the regex engine from having to match to the end of the line and then backtrack all the way back to the first column. I also changed 0-9 to \d in the character classes, and changed \d{2} to \d{1,3}, assuming you don't really want to ignore devices that are 100% full or less than 10% full.

    BTW, your regex isn't compatible across Unix platforms:

    Filesystem Type blocks use avail %use Mounted on /dev/root xfs 16718720 11050536 5668184 67 / /dev/dsk/dks1d6s0 xfs 8758624 1968528 6790096 23 /usr/darmo +k AFS afs 14400000 0 14400000 0 /afs

    Anyway, I think tilly's and lemming's suggestion of using split with a limit is the best approach.

Re: Getting rid of (.*) from a not-quite-complex regex.
by jeroenes (Priest) on Feb 04, 2001 at 14:55 UTC
    Another way to go would be to throw it in a 3D-array:
    use SuperSplit; $array = supersplit('\s+','\n','\n----\n',\*DATA');
    You'll have to handle $array->[$n][0] and ->[$n][1] differently, because they contain the date and text line. Considering the fact that you already know that every new ---- means it is another hour, you could discard the descriptive lines.
    my $str = ''; while( <DATA> ) $str .= $_ unless m|^[/-]|; } $array = supersplit('\s+','\n','\n----\n',$str);
    You could print the available space from the second filesystem for every hour like this:
    for (@$array){ print $_->[1][3]."\n"; #remember indici start at zero }
    The supersplit module can be found here.

    Hope this helps,

    Jeroen
    "We are not alone"(FZ)