parsing variable input (perlre problem)

jeanluca has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks
Below I try to parse the output from the ps command. Because Unix and Linux use different time-formats I need to built in some flexibility in my parsing expression. Here is the test-code I use:

#! /usr/bin/perl -lw

use strict ;
use warnings ;

my @i ;
#        usern      pid     ?  ? startt   ?          ?     command
$i[0] = "wwwrun   17275 10449  0 2006     ?      00:00:00 /usr/sb...";
$i[1] = "root      3826     1  0 Jan08    ?      00:00:00 su -" ;
$i[2] = "root      3826     1  0 Jan 08   ?      00:00:00 su -" ;
$i[3] = "root      3547     1  2 06:49    ?      00:11:56 zmd /us...";
$i[4] = "root      3547     1  2 06:49:12 pts/1  00:11:56 zmd /us...";

my $usern ;
my $pid ;
my $time ;
my $command ;
foreach ( @i ) { 
  ($usern, $pid, $time, $command) = ( $_ =~ /^
            (\w+)            # capture username
            \s+
            (\d+)            # capture PID
            \s+\d+\s+\d+\s+
            (?:              # cluster (not capturing)
              (\d{4})              # capture %Y
              |                    # or
              (\d{2}:\d{2})        # capture %H:%M
              |                    # or
              (\d{2}:\d{2}:\d{2})  # capture %H:%M:%S
              |                    # or
              (\w{3}\d{2})         # capture %b%d
              |                    # or
              (\w{3}\s+\d{2})      # capture %b %d
            )
            \s+\S+\s+\S+\s+  # skip 2 columns after the 5th column
            (.*)             # capture the command
    $/gx ) ;
    printf "usern=%s pid=%s, time=%s command=%s\n",
        ($usern || ""), ($pid || ""), 
        ($time || ""), ($command || "")  ;
}
[download]

The output is:

usern=wwwrun pid=17275, time=2006 command=
usern=root pid=3826, time= command=
usern=root pid=3826, time= command=
usern=root pid=3547, time= command=06:49
usern=root pid=3547, time= command=
[download]

I think I do something fundamentally wrong when parsing the date/time column.....
Any suggestion ?

Thnx
LuCa

this question is the continuation of how many instances are running ?

Comment on parsing variable input (perlre problem) Select or Download Code

Replies are listed 'Best First'.
Re: parsing variable input (perlre problem) by davorg (Chancellor) on Mar 19, 2007 at 15:24 UTC
Have you considered using Proc::ProcessTable instead of doing it yourself? Update: but for a clue as to what you are doing wrong, try capturing _all_ of the matches in an array and printing that. `foreach ( @i ) { my @proc = /^ (\w+) # capture username \s+ (\d+) # capture PID \s+\d+\s+\d+\s+ (?: # cluster (not capturing) (\d{4}) # capture %Y \| # or (\d{2}:\d{2}) # capture %H:%M \| # or (\d{2}:\d{2}:\d{2}) # capture %H:%M:%S \| # or (\w{3}\d{2}) # capture %b%d \| # or (\w{3}\s+\d{2}) # capture %b %d ) \s+\S+\s+\S+\s+ # skip 2 columns after the 5th column (.) # capture the command $/gx; print join ' / ', @proc; print "\n"; }` [download] -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club."* -- Chip Salzenberg	[reply] [d/l]
Re: parsing variable input (perlre problem) by johngg (Canon) on Mar 19, 2007 at 15:33 UTC
Two things about your regex spring to mind. Firstly, you want to look for `%H:%M:%S` before `%H:%M` because otherwise the `%H:%M` test will grab all of them. Secondly, you could capture `%d%d` and `%b %d` at the same time by doing `(\w{3}\s\d\d)`. Cheers, JohnGG Update:* The tests for `%H:%M:%S` and `%H:%M` could be combined, `(\d\d:\d\d(?::\d\d)?)`.	[reply] [d/l] [select]
Re: parsing variable input (perlre problem) by Moron (Curate) on Mar 19, 2007 at 16:35 UTC
Assuming there is good reason not to use the module suggested earlier... The thing that jumps out at me is the fact that ps output is fixed field format whereas the regexps are matching anywhere on the line of ps output. Therefore it is going to be easier and more reliable to extract fields using substr before then matching their contents against your expressions. For example: `my $flds = [{ name => UID, start => 0, length => 8 }, { name => PID, start => 8, length => 6 ), # etc. # matches the ps header line in this e.g. ]; my $pid = open my $ph, "ps -ef \|" or die $!; my @hdr = split /\s+/, <$ph>; while( <$ph> ) { my %line; for my $fld ( @$flds ) { $line{ $fld -> { name } } = substr ( $_, $fld -> { start }, $f +ld -> { length } ); } # and then match $line{ STIME } against regexps. } waitpid $pid, 0; close $ph` [download] -M Free your mind	[reply] [d/l]
Re^2: parsing variable input (perlre problem) by davorg (Chancellor) on Mar 19, 2007 at 16:50 UTC
ps output is fixed field format Well yes... until something unforeseen happens and one of the fields gets too wide. At which point the substr (or unpack - which might be more efficient) method breaks horribly :-/ -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re^3: parsing variable input (perlre problem) by Moron (Curate) on Mar 19, 2007 at 16:59 UTC
In that case just read the field positions and widths off the header -- if a field "goes wide", the header changes to match and the header is otherwise predictable even though some columns are aligned left, some right - still in a predictable way. -M Free your mind	[reply]
Re: parsing variable input (perlre problem) by jeanluca (Deacon) on Mar 20, 2007 at 12:33 UTC
Thnx for the help! I think that Proc::ProcessTable suggested by davorg is the right choise here, it seems to do all the complex work for me! But anyway, I'm still very curious about what I do wrong with the regular expression. So I changed them a little bit according to johngg suggestions and then used an array to grep all the output, this is the result(my comments included) `INPUT: root 3547 1 2 06:49:12 pts/1 00:11:56 zmd /us...... array0: root array1: 3547 array2: # no match on (\d{4}) array3: 06:49:12 array4: # no match on (\d{2}:\d{2}) array5: # no match on (\w{3}\s*\d{2}) array6: zmd /us......` [download] Ok, this result explains that the expression returns 'undef' if the match fails. Is there something that can be done here (so I can use my previous example) or should I do it this way ? Thx a lot LuCa	[reply] [d/l]
Re^2: parsing variable input (perlre problem) by davorg (Chancellor) on Mar 20, 2007 at 13:13 UTC
Sorry, I thought my clues would be enough for you to work it out. I'll be clearer. You have a regex that contains a number of capturing brackets. Each of those set of brackets will set an element in the list that is returned. As you've seen, any capturing brackets that don't match return undef. So this code: my @i ; # usern pid ? ? startt ? ? command $i[0] = "wwwrun 17275 10449 0 2006 ? 00:00:00 /usr/sb..."; $i[1] = "root 3826 1 0 Jan08 ? 00:00:00 su -" ; $i[2] = "root 3826 1 0 Jan 08 ? 00:00:00 su -" ; $i[3] = "root 3547 1 2 06:49 ? 00:11:56 zmd /us..."; $i[4] = "root 3547 1 2 06:49:12 pts/1 00:11:56 zmd /us..."; foreach ( @i ) { my @proc = /^ (\w+) # capture username \s+ (\d+) # capture PID \s+\d+\s+\d+\s+ (?: # cluster (not capturing) (\d{4}) # capture %Y \| # or (\d{2}:\d{2}) # capture %H:%M \| # or (\d{2}:\d{2}:\d{2}) # capture %H:%M:%S \| # or (\w{3}\d{2}) # capture %b%d \| # or (\w{3}\s+\d{2}) # capture %b %d ) \s+\S+\s+\S+\s+ # skip 2 columns after the 5th column (.) # capture the command $/gx; print join ' \| ', map { defined() ? $_ : 'undef' } @proc; print "\n"; } [download] Gives the following output: `wwwrun \| 17275 \| 2006 \| undef \| undef \| undef \| undef \| /usr/sb... root \| 3826 \| undef \| undef \| undef \| Jan08 \| undef \| su - root \| 3826 \| undef \| undef \| undef \| undef \| Jan 08 \| su - root \| 3547 \| undef \| 06:49 \| undef \| undef \| undef \| zmd /us... root \| 3547 \| undef \| undef \| 06:49:12 \| undef \| undef \| zmd /us...` [download] So your problem is that the datetime column can appear in a number of columns in your output depending on which part of the regex it matches. Putting it even more simply, you have too many capturing brackets. Why not remove all of the nested brackets that match the different types of datetime and replace your outer (non-capturing) brackets with one set of capturing brackets? That way, whichever regex is matched, it will always populate the same column in the output. `foreach ( @i ) { my @proc = /^ (\w+) # capture username \s+ (\d+) # capture PID \s+\d+\s+\d+\s+ ( # cluster (not capturing) \d{4} # capture %Y \| # or \d{2}:\d{2} # capture %H:%M \| # or \d{2}:\d{2}:\d{2} # capture %H:%M:%S \| # or \w{3}\d{2} # capture %b%d \| # or \w{3}\s+\d{2} # capture %b %d ) \s+\S+\s+\S+\s+ # skip 2 columns after the 5th column (.) # capture the command $/gx; print join ' \| ', @proc; print "\n"; }` [download] Which produces the following output: `wwwrun \| 17275 \| 2006 \| /usr/sb... root \| 3826 \| Jan08 \| su - root \| 3826 \| Jan 08 \| su - root \| 3547 \| 06:49 \| zmd /us... root \| 3547 \| 06:49:12 \| zmd /us...` [download] With the datetime column always appearing in the same place. -- See the Copyright notice on my home node. "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l] [select]
Re^3: parsing variable input (perlre problem) by jeanluca (Deacon) on Mar 20, 2007 at 13:40 UTC
It all makes sense now! Thanks a lot LuCa	[reply]