In your sample data, it looks like the parameter value (right hand side of each "=") is either a single word token or else it's a quoted string containing spaces. There is also a layer of structure defined by colons with following whitespace.

These features can be used to good advantage in using "split" to do a "first-level" parse of the record, and thereby get around some of the difficulties mentioned in the earlier reply.

I'm not sure exactly what sort of structure you want as output, but here's one approach, which you can probably tweak to suit your taste:

#!/usr/bin/perl use strict; use Data::Dumper; $/ = 'messages:'; while ( <DATA> ) { my %struct = (); my ( $timestamp, @chunks ) = split( /(\S+:)\s+/ ); while ( @chunks ) { my $topkey = shift @chunks; my $data = shift @chunks; while ( $data =~ s/^(.*?)=//s ) { ( my $subkey = $1 ) =~ s/\s+$//; if ( $data =~ s/^"([^"]+)"\s+// ) { $struct{$topkey}{$subkey} = $1; } else { $data =~ s/^(\S+)\s+//; $struct{$topkey}{$subkey} = $1; } } } print "\nRecord $.: $timestamp\n", Dumper( \%struct ); } __DATA__ messages:Dec 17 09:41:08 10.14.93.7 ns5xp: NetScreen device_id=ns5xp system-notification-00257(traffic): start_time="2002-12-17 09:45:58" d +uration=5 policy_id=0 service=tcp/port:8000 proto =6 src zone=Trust dst zone=Untrust action=Permit sent=1034 rcvd=19829 +src=10.14.94.221 dst=10.14.90.217 src_port=1059 dst_port=8000 transla +ted ip=10.14.93.7 port=1223 messages:Dec 17 09:41:08 10.14.93.7 ns5xp: NetScreen device_id=ns5xp +system-notification-00257(traffic): start_time="2002-12-17 09:45:59" +duration=4 policy_id=0 service=tcp/port:8000 proto =6 src zone=Trust dst zone=Untrust action=Permit sent=722 rcvd=520 src +=10.14.94.221 dst=10.14.90.217 src_port=1060 dst_port=8000 translated + ip=10.14.93.7 port=1224
(update: added $timestamp in the print statement, which shows that it's not just a timestamp, but also an IP address.)

That gives you a hash structure (HoH) on each record / iteration. Maybe you want to push those onto an array? And/or maybe you don't need all the information?

In any case, I don't think look-ahead regexes are needed here (though I'm sure there are ways to do so, and these might even make for more legible logic).

Another update: It occurs to me that you might run into some data records where there are line breaks in awkward places (other than the particular awkward spot shown in your data sample, between "proto" and "=6"). If that's the case, I think the code above will still do the right thing, but in the absence of appropriate test data, it's hard to be sure...


In reply to Re: Parsing text files with a regex lookahead by graff
in thread Parsing text files with a regex lookahead by jalewis2

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.