fs has asked for the wisdom of the Perl Monks concerning the following question:

I've got a big XML stream (several hundred megs) I need to process, speficially the XML version of a large Argus capture. The document will essentially look like
<?xml version="1.0">
<ArgusDataStream>

     <ArgusFlowRecord>
              contents of each Flow
     </ArgusFlowRecord>

      Several hundred thousand more ArgusFlowRecords...

</ArgusDataStream>
So what I need is something that can take this, preferably as a stream on STDIN, and process each ArgusRecordFlow individually without having to slurp the whole behemoth into memory first. I've searched around on CPAN, but couldn't find anything that looked like it would do this nicely. Any suggestions?

Replies are listed 'Best First'.
Re: Record based XML stream processing?
by mirod (Canon) on Jan 24, 2003 at 14:45 UTC

    Before I start my ob XML::Twig pitch... ;--)

    There are a couple of modules on CPAN that are specifically designed to process record-oriented XML:

    • XML::RAX, RAX is described in RAX, an XML Database API. With XML::RAX you just go through your document, one record at a time:while ( $rec = $R->readRecord()) { print "Phone = ".$rec->getField('Phone')."\n"; }. Note that this works only if each record is "flat" (each element within the record is a single value)
    • XML::SAX::Machines also has a record mode, which looks suspiciously like what XML::wig does... except it's SAX-based and integrates nicely with other SAX processing. I haven't tried this module (yet) but it looks interesting.

    Finally XML::Twig was designed especially for this kind of problem. You can easily get rid of part of the document after you are done processing it: you can set a twig_handler for each ArgusFlowRecord, which will call a sub that has access to everything inside the element, then call purge or flush (if you need to output the updated element) and get your memory back./,/p>

    See the tutorial, section 4.3 is what you are looking for (n ote that since I wrote the tutorial I added a field method that is equivalent to first_child->text but looks nicer.

Re: Record based XML stream processing?
by davorg (Chancellor) on Jan 24, 2003 at 13:53 UTC

    For stream-based XML processing you should probably be looking at XML::SAX. XML::LibXML and XML::Parser also support this mode.

    And mirod will almost certainly be along very soon to show you how to do it with XML::Twig :)

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: Record based XML stream processing?
by dug (Chaplain) on Jan 24, 2003 at 14:50 UTC
    Davorg has good advice here. I asked the very same question a while back. Podmaster beat mirod to the punch plugging XML::Twig, which I am using in production today :-) That node also has a simple SAX version (that I'm still not quite comfortable with).

    It should be noted that the first two solutions above are brittle, and will break on things such as nested <ArgusFlowRecord> tags in a CDATA section.

    So, take davorg's advice. Use a real parser. But if you don't, here's a little snippet that I use when speed is of the essence. It's probably broken in some way that I haven't realized yet (and some that I have). It expects its input from STDIN.

    #!/usr/bin/perl use warnings; use strict; $|++; use XML::LibXML; my $parser = XML::LibXML->new(); my $closing_root_tag = '</ArgusFlowRecord>'; my $skip_past = '<ArgusDataStream>'; { $XML::LibXML::skipXMLDeclaration = 1; local $/ = $skip_past; <>; # ignore declaration and stream open tag local $/ = $closing_root_tag; my $temp_chunk; while ( <> ) { $temp_chunk .= $_; my $dom; eval { $dom = $parser->parse_string( $temp_chunk ) }; next if $@; # keep nibbling if XML is invalid undef $temp_chunk; print $dom->toString(), "\n"; # or do other processing } }

    -- dug
Re: Record based XML stream processing?
by robartes (Priest) on Jan 24, 2003 at 13:42 UTC
    One simple thing you could do is just build your own pacing code around something like XML::Simple :
    use strict; use XML::Simple; use Data::Dumper; while (<>) { next if /xml|ArgusDataStream/; my $xml=$_; $xml.=<> while $xml !~ m|</ArgusFlowRecord>|; my $xmlhash=XMLin($xml); print Dumper($xmlhash); } __END__ $ xmlstuff.pl <xmlinput $VAR1=' Flow data set 1 '; $VAR1=' Flow data set 2 ';
    where the file xmlinput contained:
    <?xml version="1.0"> <ArgusDataStream> <ArgusFlowRecord> Flow data set 1 </ArgusFlowRecord> <ArgusFlowRecord> Flow data set 2 </ArgusFlowRecord> </ArgusDataStream>
    Your XML processing will of course be completely different (and probably not use XML::Simple), but this should give you an idea on how to proceed.

    Update: this is basically the same as marcello's solution, and is probably better handled with an event based parser such as XML::Sax::PurePerl, as per jeffa's suggestion in the chatterbox.

    CU
    Robartes-

Re: Record based XML stream processing?
by Marcello (Hermit) on Jan 24, 2003 at 13:26 UTC
    I would try something like (untested):

    use constant END_TAG = qr#</ArgusDataStream>#; my $xml = ""; while (<STDIN>) { my $line = $_; $xml .= $line; if ($line =~ END_TAG) { # Process the XML here... # Start again $xml = ""; } }
Re: Record based XML stream processing?
by digitalnoises (Initiate) on Jan 24, 2003 at 16:15 UTC
    XML::Twig