jalewis2 has asked for the wisdom of the Perl Monks concerning the following question:

I've parsed data like this before, but I have a feeling that I am doing it isn't the most efficient.

I am setting flags for each section and then reading lines for that section until the flag changes.

Here is a sample of the data

It's RPSL data in case anyone is already familiar.

aut-num: AS19710 as-name: ASN descr: S4R admin-c: SNE1 tech-c: SNE1 import: from AS3356 63.215.71.1 at 63.215.71.2 action pref=20; +med=50; from AS3356 63.215.86.133 at 63.215.86.134 action pref= +50; med=150; accept ANY import: from AS3847 action pref=10; accept ANY export: to AS3847 announce AS19710 export: to AS3356 announce AS19710 notify: nwcontact@email mnt-by: S4R changed: andy@email 20010502 source: LEV

My problem is sections like import: and export: that span multiple lines without a section heading at the start of each line. Is there a standard method for handling something like this?

Replies are listed 'Best First'.
Re: Best way to parse multiline data?
by dragonchild (Archbishop) on Apr 16, 2005 at 02:51 UTC
    Don't slurp 600M - that translates to roughly 3G+ of RAM. Not a good plan, to say the least.

    Read it line by line, but keep track of which section you're in and if the section hasn't change, just concatenate. Then, when the section changes, go ahead and process the last section, then set the current section to the new one.

    my ($section, @value); while (<FILE>) { chomp; my ( $temp_section, $temp_value ) = /^(?:([^:]+):)?\s*(.*)/; if ( $temp_section ) { if ( $section ) { # Process the old value somehow. } $section = $temp_section; @value = ( $temp_value ); } else { push @value, $temp_value; } }
      Don't slurp 600M - that translates to roughly 3G+ of RAM.

      Not if you slurp to a scalar.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.
      Rule 1 has a caveat! -- Who broke the cabal?
      I hadn't considered slurping for that exact reason. The regex is what gave me some ideas. I've never done read ahead regexes. I figured someone here at PM would steer me in the right direction.

      You sample code is exactly what I am looking for. Thanks!
Re: Best way to parse multiline data?
by BrowserUk (Patriarch) on Apr 16, 2005 at 02:04 UTC

    If your files are small, then slurping and using a regex with /s and a lookahead is the easiest way:

    #! perl -slw use strict; my $data = do{ local $/; <DATA> }; print "'$1'$2'\n" while $data =~ m[([a-z-]+):(.*?)\n(?=[a-z-]+:|$)]sg; __DATA__ aut-num: AS19710 as-name: ASN descr: S4R admin-c: SNE1 tech-c: SNE1 import: from AS3356 63.215.71.1 at 63.215.71.2 action pref=20; +med=50; from AS3356 63.215.86.133 at 63.215.86.134 action pref= +50; med=150; accept ANY import: from AS3847 action pref=10; accept ANY export: to AS3847 announce AS19710 export: to AS3356 announce AS19710 notify: nwcontact@email mnt-by: S4R changed: andy@email 20010502 source: LEV

    At each iteration of the while loop, $1 will be the section header, and $2 the body of the section with all the whitespace intact. You can further process $2 to remove or reduce the whitespace as required.

    P:\test>448390 'aut-num' AS19710' 'as-name' ASN' 'descr' S4R' 'admin-c' SNE1' 'tech-c' SNE1' 'import' from AS3356 63.215.71.1 at 63.215.71.2 action pref=20; + med=50; from AS3356 63.215.86.133 at 63.215.86.134 action pref= +50; med=150; accept ANY' 'import' from AS3847 action pref=10; accept ANY' 'export' to AS3847 announce AS19710' 'export' to AS3356 announce AS19710' 'notify' nwcontact@email' 'mnt-by' S4R' 'changed' andy@email 20010502' 'source' LEV'

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
      The files can be large, up to 600MB in one case. I hadn't considered slurping.

      Thanks for the ideas!