in reply to Design hints for a file processor

Actually, this kinda looks like XML to me, so I'd be tempted to try to use XML::Twig.

Ok, so some XML purists out there may cry, "but it has no tags!" At this point, they're nitpicking. It DOES have tags. Just not in the XML format. "Oh, but so you admit it isn't XML!" I never claimed it was XML, merely that it kinda looks like XML.

What I'd be tempted to do is write an IO::Handle-derived object to convert the input file, line by line, into XML. And then use XML::Twig to handle the actual data. Especially as you say this is "up to half a gigabyte" - XML::Twig can handle that just fine.

What I don't know is if XML::Twig needs to backtrack in a file... but if it doesn't, you just basically have to convert:

s/BEGIN \s (\w+)/<$1>/x or s/END \s (\w+)/</$1>/x or s[(\w+) \s ("\w+")][<$1 value=$2 />]x;
and then pass the line to XML::Twig. (Ok, you may need to exclude the quotes on the last one, and then use proper XML escaping in case there are funny characters there, but the idea is here.) By putting your actual code in the proper end-tag handlers for XML::Twig, and flushing as appropriate, you should use very little memory while not having to do much heavy lifting yourself.

Why write your own parser, when there already is a perfectly good parser already out there? :-)

PS: I'd also want to ask IBM when they'll start supporting XML output here ;-)

Replies are listed 'Best First'.
Re^2: Design hints for a file processor
by PhilHibbs (Hermit) on Jul 09, 2008 at 12:43 UTC
    PS: I'd also want to ask IBM when they'll start supporting XML output here ;-)
    They kind of do - you can export the whole project as XML, or a whole category of jobs from within their GUI client, but you can't export individual jobs as XML, only as the legacy DSX format. The latest version might have improved on this.