PyrexKidd has asked for the wisdom of the Perl Monks concerning the following question:

my vsftpd.log file has entries that look like this:

Sun Oct 24 10:10:29 2010 [pid 2] CONNECT: Client "192.168.0.0" Sun Oct 24 10:10:29 2010 [pid 1] [user] OK LOGIN: Client "192.168.0.0" Sun Oct 24 10:10:30 2010 [pid 3] [user] FAIL UPLOAD: Client "192.168.0 +.0", "/home/path/to/file", 0.00Kbyte/sec

I want to split it into the following sections for use later in the code
{{Sun} {Oct} {24} {10:10:29} {2010}} [pid 2] {CONNECT}: Client "{192.1 +68.0.0}" {{Sun} {Oct} {24} {10:10:29} {2010}} [pid 1] {[user]} {OK LOGIN}: Clie +nt "{192.168.0.0}" {{Sun} {Oct} {24} {10:10:29} {2010}} [pid 3] {[user]} {FAIL UPLOAD}: C +lient "{192.168.0.0}", "{/path/to/file}", 0.00Kbyte/sec
currently I am using the split function to parse the logs, but this is giving me irregular results as not all lines are the same, and there are a variety of different lines.
foreach (<$FHIN>){ my ($dow, $month, $dom, $time, $pid, $user, $status, $client, $ip, +$file_path, $dl_speed) = split / /,$_; }

can someone please suggest a better method for parsing and splitting the log?
thanks in advance.

Replies are listed 'Best First'.
Re: Parseing FTP Logs
by aquarium (Curate) on Oct 29, 2010 at 03:57 UTC
    unless you're only ever going to be interested in particular lines (and no other) in the ftp server log, i'd shy away from processing it without exhaustively covering the full log format. the log format is covered in the documentation of the ftp server software. as this log format is configurable for the ftp server, the log output could suddenly change to another format. also i think this log will behave similarly to syslogd, in that if multiple same messages occur quickly, it will log something like "last message repated 500 times" in the log, rather than logging each one.
    therefore it might be best to use a tool already written to parse vsftpd logs, rather than writing from scratch.
    the hardest line to type correctly is: stty erase ^H
Re: Parseing FTP Logs
by kcott (Archbishop) on Oct 29, 2010 at 00:42 UTC

    A better way would be to use a regular expression instead of split. By grouping the fields you want, you won't need throwaway variables like $pid and $client.

    If your use of braces (above) indicates your required fields, just change them to parentheses in the regex and you'll retrieve what you want. I note you seem to be nesting individual date-time components within a larger grouping: this works too and you'd capture that with an additional variable at the front of the current list: my ($full_date_time, dow, $month, $dom, ....

    You say there is a variety of line formats but only show three: is that the extent of the variation?

    Have a go at the regex. If you encounter further difficulties, post what you've tried and we can look at it further.

    There's a fair amount of documentation on regular expressions at perldoc.perl.org - check under the Tutorials and Reference Manual sections.

    -- Ken

Re: Parseing FTP Logs
by jethro (Monsignor) on Oct 29, 2010 at 00:52 UTC

    Well, there is Parse::RecDescent, but that seems to be overkill for your problem. If this is all the variation you get you are just a few 'if's away from a solution:

    foreach (<$FHIN>){ my ($dow, $month, $dom, $time, $pid, @next)= split; my $user= shift @next; if (not $user=~/^[/) { unshift @next, $user; $user= ''; } my $status= shift @next; $status .= ' ' . shift @next until ($status=~/:$/); my $client= shift @next; ...

    As you can see it is not very difficult if the variations are few and easy to differentiate.

    If you have more variations I would recommend using a finite state machine. If you search this site for that name or FSM you should find a few nodes that explain what that is