Actually, this kinda looks like XML to me, so I'd be tempted to try to use XML::Twig.

Ok, so some XML purists out there may cry, "but it has no tags!" At this point, they're nitpicking. It DOES have tags. Just not in the XML format. "Oh, but so you admit it isn't XML!" I never claimed it was XML, merely that it kinda looks like XML.

What I'd be tempted to do is write an IO::Handle-derived object to convert the input file, line by line, into XML. And then use XML::Twig to handle the actual data. Especially as you say this is "up to half a gigabyte" - XML::Twig can handle that just fine.

What I don't know is if XML::Twig needs to backtrack in a file... but if it doesn't, you just basically have to convert:

s/BEGIN \s (\w+)/<$1>/x or s/END \s (\w+)/</$1>/x or s[(\w+) \s ("\w+")][<$1 value=$2 />]x;
and then pass the line to XML::Twig. (Ok, you may need to exclude the quotes on the last one, and then use proper XML escaping in case there are funny characters there, but the idea is here.) By putting your actual code in the proper end-tag handlers for XML::Twig, and flushing as appropriate, you should use very little memory while not having to do much heavy lifting yourself.

Why write your own parser, when there already is a perfectly good parser already out there? :-)

PS: I'd also want to ask IBM when they'll start supporting XML output here ;-)


In reply to Re: Design hints for a file processor by Tanktalus
in thread Design hints for a file processor by PhilHibbs

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.