Best way to parse multiline data?

jalewis2 has asked for the wisdom of the Perl Monks concerning the following question:

I've parsed data like this before, but I have a feeling that I am doing it isn't the most efficient.

I am setting flags for each section and then reading lines for that section until the flag changes.

Here is a sample of the data

It's RPSL data in case anyone is already familiar.

aut-num:       AS19710
as-name:       ASN
descr:         S4R
admin-c:       SNE1
tech-c:        SNE1
import:        from AS3356 63.215.71.1 at 63.215.71.2 action pref=20; 
+med=50;
               from AS3356 63.215.86.133 at 63.215.86.134 action pref=
+50; med=150;
               accept ANY
import:        from AS3847 action pref=10;
               accept ANY
export:        to AS3847
               announce AS19710
export:        to AS3356
               announce AS19710
notify:        nwcontact@email
mnt-by:        S4R
changed:       andy@email 20010502
source:        LEV
[download]

My problem is sections like import: and export: that span multiple lines without a section heading at the start of each line. Is there a standard method for handling something like this?

Comment on Best way to parse multiline data? Download Code

Replies are listed 'Best First'.
Re: Best way to parse multiline data? by dragonchild (Archbishop) on Apr 16, 2005 at 02:51 UTC
Don't slurp 600M - that translates to roughly 3G+ of RAM. Not a good plan, to say the least. Read it line by line, but keep track of which section you're in and if the section hasn't change, just concatenate. Then, when the section changes, go ahead and process the last section, then set the current section to the new one. `my ($section, @value); while (<FILE>) { chomp; my ( $temp_section, $temp_value ) = /^(?:([^:]+):)?\s(.)/; if ( $temp_section ) { if ( $section ) { # Process the old value somehow. } $section = $temp_section; @value = ( $temp_value ); } else { push @value, $temp_value; } }` [download] My wife's blog	[reply] [d/l]
Re^2: Best way to parse multiline data? by BrowserUk (Patriarch) on Apr 16, 2005 at 03:39 UTC
Don't slurp 600M - that translates to roughly 3G+ of RAM. Not if you slurp to a scalar. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply] [d/l]
Re^3: Best way to parse multiline data? by dragonchild (Archbishop) on Apr 18, 2005 at 12:36 UTC
It's still 600M + 41 bytes. :-) My wife's blog	[reply]
Re^2: Best way to parse multiline data? by jalewis2 (Monk) on Apr 16, 2005 at 03:16 UTC
I hadn't considered slurping for that exact reason. The regex is what gave me some ideas. I've never done read ahead regexes. I figured someone here at PM would steer me in the right direction. You sample code is exactly what I am looking for. Thanks!	[reply]
Re: Best way to parse multiline data? by BrowserUk (Patriarch) on Apr 16, 2005 at 02:04 UTC
If your files are small, then slurping and using a regex with /s and a lookahead is the easiest way: #! perl -slw use strict; my $data = do{ local $/; <DATA> }; print "'$1'$2'\n" while $data =~ m[([a-z-]+):(.*?)\n(?=[a-z-]+:\|$)]sg; __DATA__ aut-num: AS19710 as-name: ASN descr: S4R admin-c: SNE1 tech-c: SNE1 import: from AS3356 63.215.71.1 at 63.215.71.2 action pref=20; +med=50; from AS3356 63.215.86.133 at 63.215.86.134 action pref= +50; med=150; accept ANY import: from AS3847 action pref=10; accept ANY export: to AS3847 announce AS19710 export: to AS3356 announce AS19710 notify: nwcontact@email mnt-by: S4R changed: andy@email 20010502 source: LEV [download] At each iteration of the while loop, $1 will be the section header, and $2 the body of the section with all the whitespace intact. You can further process $2 to remove or reduce the whitespace as required. `P:\test>448390 'aut-num' AS19710' 'as-name' ASN' 'descr' S4R' 'admin-c' SNE1' 'tech-c' SNE1' 'import' from AS3356 63.215.71.1 at 63.215.71.2 action pref=20; + med=50; from AS3356 63.215.86.133 at 63.215.86.134 action pref= +50; med=150; accept ANY' 'import' from AS3847 action pref=10; accept ANY' 'export' to AS3847 announce AS19710' 'export' to AS3356 announce AS19710' 'notify' nwcontact@email' 'mnt-by' S4R' 'changed' andy@email 20010502' 'source' LEV'` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply] [d/l] [select]
Re^2: Best way to parse multiline data? by jalewis2 (Monk) on Apr 16, 2005 at 02:43 UTC
The files can be large, up to 600MB in one case. I hadn't considered slurping. Thanks for the ideas!	[reply]