Getting rid of (.*) from a not-quite-complex regex.

deprecated has asked for the wisdom of the Perl Monks concerning the following question:

Ah, fellow monks, I am a depraved regex junky. I just cant get enough and find myself coding them for the heck of it. Is there help for me? But, I digress. Let us examine this code:

    my ($ptid, $total, $used, $avail, $pct, $mp) = $element =~
      m!^(/dev/.+[0-9]+)      # which partition $ptid
          \s+([0-9.]+[MGK])   # total size of parition $total
          \s+([0-9.]+[MGK])   # used space $used
          \s+([0-9.]+[MGK])   # available space $avail
          \s+(\d{2})%         # percent usage $pct
          \s+(.*)$!x;         # mounting point $mp
[download]

This regex is being used to match this data:

----
Sat Feb 3 12:01:01 EST 2001
Filesystem            Size  Used Avail Use% Mounted on
/dev/hda7             904M  261M  598M  30% /
/dev/hda12            852M  378M  474M  44% /devel
/dev/hda10            9.8G  9.6G  256M  97% /home
/dev/hda9             1.8G  1.6G  225M  88% /home/dl
/dev/hda5             768M  751M   17M  98% /mnt/macos
/dev/hda8             3.9G  3.4G  304M  92% /usr
/dev/hda6             387M   93M  275M  25% /var
/dev/hdb5            1008M  591M  365M  62% /home/ftp
/dev/hdb6            1008M  209M  748M  22% /home/httpd
/dev/hdb9             1.5G  1.1G  358M  75% /mnt/build
/dev/hdb8             640M  456M  151M  75% /mnt/mp3
[download]

which is being repeated on an hourly cronjob. So this could easily turn into several megs (or even dozens of megs) of text. Therefore, speed will be an issue.

So I'm looking at this and see a pretty specific regex. I thought of substituting \S+ for .*. However, in unix (nt compatibility, obviously, is not an issue here) mounting points can include awful characters like *, \n, \a, and so on. So, basically, I see two flaws to the expression. First, the use of .* (and .+), and second the part where \s+([0-9.]+[MGK]) is captured seems repetitive. Has anyone got some regex-tuning hints here?

thanks,
brother dep.

--
i am not cool enough to have a signature.

Comment on Getting rid of (.*) from a not-quite-complex regex. Select or Download Code

Replies are listed 'Best First'.
Re (tilly) 1: Getting rid of (.*) from a not-quite-complex regex. by tilly (Archbishop) on Feb 03, 2001 at 23:01 UTC
`my ($ptid, $total, $used, $avail, $pct, $mp) = split /\s+/, $element, +6;` [download] Also unpack would be natural for this.	[reply] [d/l]
Re: Re (tilly) 1: Getting rid of (.*) from a not-quite-complex regex. by Anonymous Monk on Feb 05, 2001 at 08:16 UTC
You don't want to assume "fixed-width" columns; if the numbers are large then the format breaks. The whitespace is normally more consistent. Hey, is there a perl interface to statfs(2)?	[reply]
Re: Re: Re (tilly) 1: Getting rid of (.*) from a not-quite-complex regex. by eg (Friar) on Feb 05, 2001 at 08:21 UTC
File::Df and/or Filesys::Df.	[reply]
Re: Getting rid of (.*) from a not-quite-complex regex. by lemming (Priest) on Feb 03, 2001 at 23:13 UTC
What about using split? `next unless $element =~ m!^/!; my ($ptid, $total, $used, $avail, $pct, $mp) = split(' ', $element, 6);` [download] I assumed there was a while loop, hence the next. Since we shouldn't be worried about spaces in any of the fields except for the sixth field we specify to only split six times in case you have a space in your mount point. yech.	[reply] [d/l]
Re: Getting rid of (.*) from a not-quite-complex regex. by dws (Chancellor) on Feb 03, 2001 at 23:45 UTC
If you want a bit more precision than `split`, consider using `unpack`. The psgrep example on page 37 of Perl Cookbook is right on point for what you're doing. That example wraps a number of tricks into a small script, and is well worth the time to study.	[reply] [d/l] [select]
Re: Getting rid of (.*) from a not-quite-complex regex. by chipmunk (Parson) on Feb 04, 2001 at 22:43 UTC
You say don't want to replace the .* at the end with \S+, because the last column could contain arbitrary columns. However, that's not a problem for the first column, so you can replace the .+: `my ($ptid, $total, $used, $avail, $pct, $mp) = $element =~ m!^(/dev/\S+) # which partition $ptid \s+([\d.]+[MGK]) # total size of parition $total \s+([\d.]+[MGK]) # used space $used \s+([\d.]+[MGK]) # available space $avail \s+(\d{1,3})% # percent usage $pct \s+(.*)$!x; # mounting point $mp` [download] This will save the regex engine from having to match to the end of the line and then backtrack all the way back to the first column. I also changed 0-9 to \d in the character classes, and changed \d{2} to \d{1,3}, assuming you don't really want to ignore devices that are 100% full or less than 10% full. BTW, your regex isn't compatible across Unix platforms: `Filesystem Type blocks use avail %use Mounted on /dev/root xfs 16718720 11050536 5668184 67 / /dev/dsk/dks1d6s0 xfs 8758624 1968528 6790096 23 /usr/darmo +k AFS afs 14400000 0 14400000 0 /afs` [download] Anyway, I think tilly's and lemming's suggestion of using split with a limit is the best approach.	[reply] [d/l] [select]
Re: Getting rid of (.*) from a not-quite-complex regex. by jeroenes (Priest) on Feb 04, 2001 at 14:55 UTC
Another way to go would be to throw it in a 3D-array: `use SuperSplit; $array = supersplit('\s+','\n','\n----\n',\DATA');` [download] You'll have to handle `$array->[$n][0] and ->[$n][1]` differently, because they contain the date and text line. Considering the fact that you already know that every new ---- means it is another hour, you could discard the descriptive lines. `my $str = ''; while( <DATA> ) $str .= $_ unless m\|^[/-]\|; } $array = supersplit('\s+','\n','\n----\n',$str);` [download] You could print the available space from the second filesystem for every hour like this: `for (@$array){ print $_->[1][3]."\n"; #remember indici start at zero }` [download] The supersplit module can be found here. Hope this helps, Jeroen "We are not alone"(FZ)*	[reply] [d/l] [select]