Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear brethren,
I am fairly new to perl and am having difficulty parsing a logfile that does NOT have a common delimiter. A sample from the logfile looks like this:

10/1/2003 2:06:32 AM|1|Checkout Started for US-02-14@comany.com (requested by aad6870)
10/2/2003 2:07:17 AM|1|Checkout Processed for US-02-14@company.com (requested by aad6870)
10/3/2003 2:09:37 AM|1|Checkin Processed for DN-US-02-14@company.com (requested by aad6870)
10/4/2003 9:37:53 AM|1|Checkout Started for DN-US-02-14@company.com (requested by heavis6608)
10/5/2003 9:38:29 PM|1|Checkout Processed for US-02-14@company.com (requested by heavis6608)
10/6/2003 10:10:21 AM|1|Checkout Started for US-02-17@company.com (requested by vm_karthik3521)

I need to parse out the date stamp, the activity 'Checkout Started', the machine 'US-02-14@company.com', and the username 'aad6870'. What I've started on is the following:
while (<LOGFILE1>) { # parse the date field with the '|' delimiter @dateField = split(/\|/); # parse the activity field by matching '|1|'word space word @activityField = split(/\|1\|\w\s\w/); # Start populating the fields as appropriate $date = $dateField[0]; $activity = $activityField[0]; ## Write the cleaned up data to the data file print DATAFILE "$date , "; print DATAFILE "$activity ,"; print DATAFILE "\n"; }

My problem is that the parsing of the activity field returns me the entire row. Can anyone tell me what is wrong with my regular expression for the activifity field?
Am I heading in the right direction or have the I been possessed by Perl gremlins...
Any help or suggestions greatly appreciated....

Replies are listed 'Best First'.
Re: Regular Expression help
by Roger (Parson) on Nov 11, 2003 at 23:39 UTC
    Perhaps you are looking for something like this instead? :)
    use strict; use Data::Dumper; while (<DATA>) { chomp; # remove trailing \n, optional my @rec = split /\|/; # split records my $date = $rec[0]; my ($activity, $machine, $requester) = $rec[2] =~ /(.*)\sfor\s(.*)\s\(requested by (.*)\)/; print "$date, $activity, $machine, $requester\n"; } __DATA__ 10/1/2003 2:06:32 AM|1|Checkout Started for US-02-14@comany.com (reque +sted by aad6870) 10/2/2003 2:07:17 AM|1|Checkout Processed for US-02-14@company.com (re +quested by aad6870) 10/3/2003 2:09:37 AM|1|Checkin Processed for DN-US-02-14@company.com ( +requested by aad6870) 10/4/2003 9:37:53 AM|1|Checkout Started for DN-US-02-14@company.com (r +equested by heavis6608) 10/5/2003 9:38:29 PM|1|Checkout Processed for US-02-14@company.com (re +quested by heavis6608) 10/6/2003 10:10:21 AM|1|Checkout Started for US-02-17@company.com (req +uested by vm_karthik3521)
    And the output is -
    10/1/2003 2:06:32 AM, Checkout Started, US-02-14@comany.com, aad6870 10/2/2003 2:07:17 AM, Checkout Processed, US-02-14@company.com, aad687 +0 10/3/2003 2:09:37 AM, Checkin Processed, DN-US-02-14@company.com, aad6 +870 10/4/2003 9:37:53 AM, Checkout Started, DN-US-02-14@company.com, heavi +s6608 10/5/2003 9:38:29 PM, Checkout Processed, US-02-14@company.com, heavis +6608 10/6/2003 10:10:21 AM, Checkout Started, US-02-17@company.com, vm_kart +hik3521
    I think the only trick here is with the my ($var) = $str =~ /(.*)/; idiom. Which is a handy one to master. tachyon had a Perl Meditation not long ago on this topic... 291543

    Also you don't need to print the elements one line at a time, you can print them all at once.

      Hi Roger, Thanks for the help, this really deepens my understanding of regular expressions. I wonder could you tell me how I could parse it without stripping the date out first with the /\|/ split. In other words is there a way to get a record to parse without splitting?? Once again thanks for the help.
        Ok, you can change my previous code to -
        my ($activity, $machine, $requester) = /\|1\|(.*)\sfor\s(.*)\s\(reques +ted by (.*)\)/;
        I have omitted the implicit $_ =~ part in the idiom. What the new code does is to look for the |1| pattern followed by the stuff you are looking for. Note that at this point, $_ holds the entire line.

Re: Regular Expression help
by davido (Cardinal) on Nov 12, 2003 at 05:44 UTC
    This may seem a little funky, and it is somewhat dependant on what you want to allow as machine and user names. But this should give you something to work with.

    my ( $date, $activity, $machine, $company ) = $_ =~ m/^(.+(?:AM|PM)) \|\d\| (.+) \sfor\s ([\w\d-]+@[\w\d.-]) \s.+\bby\s ([\w\d]) [^\w\d]+$ /x;

    I used the /x modifier to allow whitespace within the RE, so that I could group it in segments that each accomplish a different portion of the match. You probably ought to also have a look at perlre and perlretut, as well as perlfaq6. They will go a long way toward giving you a good comfort level with RE's.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: Regular Expression help
by ysth (Canon) on Nov 11, 2003 at 23:34 UTC
    \w matches a word character, not a word. Repeat it like \w+