jalewis2 has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to improve my coding skills by reading files by records instead of just by lines. I'm having trouble with lookaheads and was hoping someone could point me in the right direction.

I have data in the format below. It consists of records on multiple lines with multiple fields. I've attempted to just keep splitting the data by space or colon or whatever the seperator is, but I think a lookahead would be faster with less code.
messages:Dec 17 09:41:08 10.14.93.7 ns5xp: NetScreen device_id=ns5xp system-notification-00257(traffic): start_time="2002-12-17 09:45:58" d +uration=5 policy_id=0 service=tcp/port:8000 proto =6 src zone=Trust dst zone=Untrust action=Permit sent=1034 rcvd=19829 +src=10.14.94.221 dst=10.14.90.217 src_port=1059 dst_port=8000 transla +ted ip=10.14.93.7 port=1223 messages:Dec 17 09:41:08 10.14.93.7 ns5xp: NetScreen device_id=ns5xp +system-notification-00257(traffic): start_time="2002-12-17 09:45:59" +duration=4 policy_id=0 service=tcp/port:8000 proto =6 src zone=Trust dst zone=Untrust action=Permit sent=722 rcvd=520 src +=10.14.94.221 dst=10.14.90.217 src_port=1060 dst_port=8000 translated + ip=10.14.93.7 port=1224
In the above, I'd like to read each section and put the results for the record in a hash, with the left part of the = as the label and the right part as the data, like so; $hash{$label} = $data; Then after each record I can print the labels I'm interested in.
while ( /^(.+)\=(.+?)(?=^.+?:)/mgso ) { $label = $1; $data = $2; }
I'm just not getting lookaheads and have been having trouble finding an example like my data. Any pointers?

Replies are listed 'Best First'.
Re: Parsing text files with a regex lookahead
by graff (Chancellor) on Sep 17, 2007 at 22:03 UTC
    In your sample data, it looks like the parameter value (right hand side of each "=") is either a single word token or else it's a quoted string containing spaces. There is also a layer of structure defined by colons with following whitespace.

    These features can be used to good advantage in using "split" to do a "first-level" parse of the record, and thereby get around some of the difficulties mentioned in the earlier reply.

    I'm not sure exactly what sort of structure you want as output, but here's one approach, which you can probably tweak to suit your taste:

    #!/usr/bin/perl use strict; use Data::Dumper; $/ = 'messages:'; while ( <DATA> ) { my %struct = (); my ( $timestamp, @chunks ) = split( /(\S+:)\s+/ ); while ( @chunks ) { my $topkey = shift @chunks; my $data = shift @chunks; while ( $data =~ s/^(.*?)=//s ) { ( my $subkey = $1 ) =~ s/\s+$//; if ( $data =~ s/^"([^"]+)"\s+// ) { $struct{$topkey}{$subkey} = $1; } else { $data =~ s/^(\S+)\s+//; $struct{$topkey}{$subkey} = $1; } } } print "\nRecord $.: $timestamp\n", Dumper( \%struct ); } __DATA__ messages:Dec 17 09:41:08 10.14.93.7 ns5xp: NetScreen device_id=ns5xp system-notification-00257(traffic): start_time="2002-12-17 09:45:58" d +uration=5 policy_id=0 service=tcp/port:8000 proto =6 src zone=Trust dst zone=Untrust action=Permit sent=1034 rcvd=19829 +src=10.14.94.221 dst=10.14.90.217 src_port=1059 dst_port=8000 transla +ted ip=10.14.93.7 port=1223 messages:Dec 17 09:41:08 10.14.93.7 ns5xp: NetScreen device_id=ns5xp +system-notification-00257(traffic): start_time="2002-12-17 09:45:59" +duration=4 policy_id=0 service=tcp/port:8000 proto =6 src zone=Trust dst zone=Untrust action=Permit sent=722 rcvd=520 src +=10.14.94.221 dst=10.14.90.217 src_port=1060 dst_port=8000 translated + ip=10.14.93.7 port=1224
    (update: added $timestamp in the print statement, which shows that it's not just a timestamp, but also an IP address.)

    That gives you a hash structure (HoH) on each record / iteration. Maybe you want to push those onto an array? And/or maybe you don't need all the information?

    In any case, I don't think look-ahead regexes are needed here (though I'm sure there are ways to do so, and these might even make for more legible logic).

    Another update: It occurs to me that you might run into some data records where there are line breaks in awkward places (other than the particular awkward spot shown in your data sample, between "proto" and "=6"). If that's the case, I think the code above will still do the right thing, but in the absence of appropriate test data, it's hard to be sure...

      Appreciate the suggestions. I always learn something when someone else takes a stab at the code.

      The main problem is the format changes in each record, so one pass through the line doesn't cut it. I briefly thought about breaking up the splits, but didn't think it would work in every case. After seeing your code, I might try again.
Re: Parsing text files with a regex lookahead
by eff_i_g (Curate) on Sep 17, 2007 at 20:48 UTC
    I came up with the following, which isn't 100% there. The difficulty is that some keys use underscores ("device_id") and others do not ("src zone"). Is there a way to fix this in the application/configuration? Is it correct assuming that "proto" goes with "6"?

    Update: New pattern...
    / ((?:src\s|dst\s|translated\s)?\S+) \n? = (")? ((?:(?(2)[^"]|\S)(?!\S+=))+) /xmg;
    use warnings; use strict; use Data::Dumper; my %hash; my @records; my $record; { local $/ = "\n\n"; @records = <DATA>; } for (@records) { ++$record; $hash{$record}{$1} = $3 while $_ =~ /(\S+)=(")?((?:(?(2)[^"]|\S)(? +!\S+=))+)/mg; } print Data::Dumper->Dump([\%hash]); __DATA__ messages:Dec 17 09:41:08 10.14.93.7 ns5xp: NetScreen device_id=ns5xp system-notification-00257(traffic): start_time="2002-12-17 09:45:58" d +uration=5 policy_id=0 service=tcp/port:8000 proto =6 src zone=Trust dst zone=Untrust action=Permit sent=1034 rcvd=19829 +src=10.14.94.221 dst=10.14.90.217 src_port=1059 dst_port=8000 transla +ted ip=10.14.93.7 port=1223 messages:Dec 17 09:41:08 10.14.93.7 ns5xp: NetScreen device_id=ns5xp +system-notification-00257(traffic): start_time="2002-12-17 09:45:59" +duration=4 policy_id=0 service=tcp/port:8000 proto =6 src zone=Trust dst zone=Untrust action=Permit sent=722 rcvd=520 src +=10.14.94.221 dst=10.14.90.217 src_port=1060 dst_port=8000 translated + ip=10.14.93.7 port=1224
      Those are netscreen firewall logs... so no chance of modifying the output. I sometimes wonder what the developers of these devices are thinking when the make them.

      Thanks for the try, I have some ideas to work with now.