bowei_99 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse input like this:
hostname: / level=incr, 1015 MB 00:00:54 + 26137 files hostname: /boot level=incr, 3731 KB 00:00:13 + 8 files hostname: /directory1 level=incr, 2796 MB 00:01:28 71 + files hostname: /directory2 level=incr, 1369 MB 00:5 +3:51 36 files --->truncated here<--
With this code:
use constant HOST_LINE => qr{ \s+ \w+: \s [\/\w+] \s+ # host, partition level=\w+, \s+ # backup level (\d+) \s+ (\w+) \s+ # amt backed up \d+:\d+:\d+ \s+ # Time (\d+) \s+ files # num. files }xms ; my @lines = <STDIN>; for my $line (@lines) { if ( $line =~ HOST_LINE ) { my ($files, $units, $amt_backed_up) = ($3, $2, $1); print "line - $line, units - $units, amt - $amt_backed_up, fil +es $files\n"; } else { print "here -> $line" } }
and it catches the first line OK and prints the captured values. However, it's not doing that for the other lines else after that, i.e. hits the 'else' and prints "here ->...". I've checked the input file for special characters, of which there are none. In fact, I don't see anything differentiating the format of the lines, i.e. don't know why only the first line is parsed correctly.

Thoughts, anyone?

-- Burvil

Replies are listed 'Best First'.
Re: Regexp parses only first line correctly
by NetWallah (Canon) on Aug 16, 2008 at 19:03 UTC
    The first line of your regex uses square brackets, which indicate a character class.

    Keep the class, move the "+" outside:

    \s+ \w+: \s [\/\w]+ \s+
    In your original regex, the first line matched because it had a one-character directory name that satisfied [\/\w+], where the "+" ended up inside the character class, instead of indicating repetition.

         Have you been high today? I see the nuns are gay! My brother yelled to me...I love you inside Ed - Benny Lava, by Buffalax

Re: Regexp parses only first line correctly
by broomduster (Priest) on Aug 16, 2008 at 19:10 UTC
    The 'partition' part of a line should be matched by
    \/\w*
    (no character class needed, and note '*' quantifier). Change the first line of your regex to
    \s+ \w+: \s \/\w* \s+ # host, partition
    and it will work (at least it does here).

    Updated: added bit to note change in quantifier relative to OP.

      I had that initial suggestion as well, but withdrew that recommendation because it will fail if the line contains a path with more than one slash, such as :
      hostname: /dir3/dir4 level=incr, 1449 MB 00:56:31 46 + files
      Keeping the character class, but correcting repetition will work in this case as well.

           Have you been high today? I see the nuns are gay! My brother yelled to me...I love you inside Ed - Benny Lava, by Buffalax

        Yes, I see that you (almost completely) changed your original reply, but with no indication of the changes.

        Even better might be

        \s+ \w+: \s /\w*(?:/\w+)* \s+ # host, partition
        Update: Getting a regex in there that matches all possible (legal) path specifications is more complicated than what I have here. Usual advisories about "know your data" apply.