cyberconte has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am requesting the guidance of the regexp specialists with this one. Its a fairly specific problem, so the solution may be easy. I am writing an indexer and need to parse the output of a "recurse; dir" from smbclient. (I know theres a module, but it doesn't suit my needs, unfortunately, so i'm doing it manually).
What i'm trying to do, is break it up into filename(with extention), extention, attributes, size, and date (for later insertion into a database). my current function i've been using to benchmark looks like...
sub parse{ my $path='\\'; my $computer = "LAIN"; foreach(@_) { # skip if it doesn't start with either a space or backslash next if (!/^[ \\]/); # check if path listing (starts with \) and if so, get the path if ($_=~/^(\\[\w|\s|\\]*\w+)/) { $path = $1.'\\'; } else { if (/^\s*(.*\S)\s{5,}([HDRSA]+)\s*(\d+)\s*(.*)/) { # dont include . or .. directories unless (($1 eq '.') or ($1 eq '..')) { my ($file,$att,$size,$date,$ext)=($1,$2,$3,$4); if ($file =~ /\.(.*)/) { $ext=$1; } else { $ext="DIR";} print "{$computer\:$path$file, $ext, $att, $size, $date }\ +n"; } } } } }
Where @_ is just a line by line list of what smbclient returns
I'm trying to reduce everything in the else to a single regexp and a print statement. The problem is, it need to *not* match "." or ".." entries, while still matching directories and files w/o extentions.
I'm not doing it for any good reason other than trying to learn regexps a bit more (i'm not too goot at them just yet)
Any suggestions?

Edit kudra, 2002-04-19 Changed title per ntc request

Replies are listed 'Best First'.
Meta aside
by Fletch (Bishop) on Apr 07, 2002 at 04:05 UTC
    I'm trying to reduce everything in the else to a single regexp ...

    This probably isn't the best idea, especially since you yourself say that you're not great with regexen. It may be easier to break things into seperate chunks and parse the chunks with seperate regexen. It'll probably also lead to a bit clearer program since you're not cramming everything into one magic expression.

    If you do do one humongous regex, make sure to use the /x modifier and comment it so you'll understand it when you look at it 4.73 months from now.

Re: Happy fun regexping
by graff (Chancellor) on Apr 07, 2002 at 06:38 UTC
    it need to *not* match "." or ".." entries

    How about getting rid of these just like you do with the other useless lines:

    ... foreach(@_) { # skip if it doesn't start with either a space or backslash next if (!/^[ \\]/); # skip if it's just "." or ".." next if (/^\s*\.{1,2}\s/); if (/^\\/) { $path = "$_\\" } elsif (/^\s*(.*\S)\s{5,}([HDRSA]+)\s*(\d+)\s*(.*)/) { my ($file,$att,$size,$date) = ($1,$2,$3,$4); my $ext = ( $file =~ /\.([^.]+)$/ ) ? $1 : "DIR"; print ...; } }

    That last bit about setting $ext follows your assumption that if there's no dot in the name, it must be a directory (but I think this is not a reliable assumption). Note that file names may contain multiple periods, and I think you want $ext to hold just the characters after the last one (in your original version, a file name like "rel_3.1.tar.gz" would set $ext to "1.tar.gz").

    Your expression for getting/setting $path was also a bit odd. The perlre man page says:

    Also remember that "|" is interpreted as a literal within square brackets, so if you write "[fee|fie|foe]" you're really only matching "[feio|]".
    And of course, directory names might include dash, period or other punctuation that wouldn't match \w. Your code made it seem like lines with initial slash would contain only a path name and nothing else, so I simplified on that basis (but I don't know if this assumption is correct).
      Well, taking into account the many suggestions i've recieved from both here and from fellow programmers (thank you, who's responded), i've done it! This is what i've come up with...
      sub adj3_gi { my $path='\\'; my $computer = "LAIN"; foreach(@_) { # skip if it doesn't start with either a space or backslash, or if + it starts with " ." or " .." next if ((/^[^ \\]/) || (/^ {2}\.{1,2}\s/)); # process path if first char is '\' if (/^\\/) { chop; $path = "$_\\"; } # break apart returned directory and file info elsif (/^ {2}(.*?(\.([^\.]+?))?) {5} *([HDRSA]*) +(\d+) {2}(.*)/gi +`) { #print "{$computer\:$path$1, ". (defined $3 ? $3 : "").", $4, +$5, $6 }\n"; } } }
      putting everything in that one regexp made everything much faster. However theres one anomaly that i don't quite understand. I played a little with the options at the end of the regexp, mainly "g" and "i".
      with g: 4 wallclock secs ( 3.91 usr + 0.00 sys = 3.91 CPU) with gi: 4 wallclock secs ( 3.64 usr + 0.00 sys = 3.64 CPU) with i: 7 wallclock secs ( 6.40 usr + 0.01 sys = 6.41 CPU) with none: 6 wallclock secs ( 6.77 usr + 0.00 sys = 6.77 CPU)
      I ran this several times, and the results were all similar. Now i could *possibly* understand the g making things faster. but the i? i was always under the impression (from both professors and fellow coders) that the /i would make things slower. Noone i asked can explain it. Or is this more regexp voodoo? ^_^

        Do not attempt to remove the . and .. entries with a regexp, because you will no doubt get it wrong, and cheaper alternatives exist. A reasonable way to get rid of them is:

          next if $_ eq '.' or $_ eq '..'

        An even more reasonable way is to isolate yourself from cross-platform diffences by using File::Spec.

          next if $_ eq File::Spec::curdir or $_ eq File::Spec::updir

        See Re: Is readdir ever deterministic? for the canonical question and answer on the subject.


        print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'