Lorphos has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a large XML document with more than 100.000 records (so I don't want to read it all at once). Each record contains, among other elements, a category and a title. I need to change the category element of the record depending on certain aspects of the title element.

I was thinking about using XML::SAX::ByRecord for this. The part I'm having a problem with is accessing the contents of the title element (which comes after the category in the XML) while processing the category element. How is this done within this paradigm?
My guess is once I see the category element I have to collect data until I come across the title element, modify the category and then somehow "release" all the stuff that I have collected. I'm not sure how this is done in a XML::SAX filter.

Best,
-Sven

  • Comment on Modifying Records with XML::SAX::ByRecord

Replies are listed 'Best First'.
Re: Modifying Records with XML::SAX::ByRecord
by samtregar (Abbot) on Aug 29, 2007 at 17:32 UTC
    You're on the right track. I've never used XML::SAX::ByRecord, but I used the technique you describe extensively while building XML:Validator::Schema. Usually it's a simple matter of keeping a stack of data in your filter object. So when you see a category you do:

    my $category_stack = $self->{category_stack} ||= []; push @$category_stack, { some => $category, data => $here };

    Then when you see a title and you're ready to finish work on the category you just pop it off:

    my $category_stack = $self->{category_stack}; my $category_data = pop @$category_stack;

    Using a stack allows categories to nest. Popping off the category data means the stack is conservative and will shrink as stored data is used. You definitely want to avoid building a data structure that can contain all the category data in your file.

    Does that make sense?

    -sam

      It sure makes sense to me, thank you! I'd have to use shift instead of pop to keep the elements in the order they appeared in.
      What I ended up doing is very similar to what you proposed:
      if ($collecting) { push @event_queue, sub { $self->SUPER::start_element($xml_data); }; }
      then, later on I empty the queue by executing the closures:
      foreach (@event_queue) { &$_(); }
Re: Modifying Records with XML::SAX::ByRecord
by trwww (Priest) on Aug 29, 2007 at 20:36 UTC

    Hello,

    I have a large XML document... I need to change the category element of the record depending on certain aspects of the title element. I was thinking about using XML::SAX::ByRecord for this.

    I don't think ByRecord is applicable. I think I'd only want to use it when I want to do something with the entire record. You only want to modify a single node in a record, albeit based on the value of another node in the record.

    My guess is once I see the category element I have to collect data until I come across the title element, modify the category and then somehow "release" all the stuff that I have collected.

    Basically. Your code shouldn't take in to account what order the record properties are in because thats against the rules of XML.

    What I would do, translated from SAX to English, is buffer the category and title nodes until the record's end_element is fired, temporarily removing them from the SAX stream. Then when the record's end_element callback is fired, I'd do the transformation and add the title and category back to the SAX stream.

    I'm not sure how this is done in a XML::SAX filter.

    Lets look at an example. Lets say the specification is to split the title on spaces and add them to the category as a comma seperated string. So a node that looks like this:

    <record> <title>Learning Perl</title> <category>Programming</category> </record>

    Will, after it comes out the other side of the SAX stream, look like this:

    <record> <title>Learning Perl</title> <category>Programming,Learning,Perl</category> </record>

    The following is a self contained program that does just that. More specifically, it filters the XML in the __DATA__ section, runs it through the custom XML::Filter::Category module, and prints the result to STDOUT:

    use warnings; use strict; use XML::SAX::Machines qw(Pipeline); my $m = Pipeline( 'XML::Filter::BufferText' => 'XML::Filter::Category' => \*STDOUT ); $m->parse_file( *DATA ); package XML::Filter::Category; use base qw(XML::SAX::Base); # <marker language="foo" /> # $el->{Name} == 'marker' # $el->{Attributes}{'{}language'} == language attribute # $el->{Attributes}{'{}language'}{Value} == 'foo' sub start_element { my($self, $el) = @_; if ( $el->{Name} eq 'title' ) { $self->{record}{title}{start_element} = $el; $self->{in_title} = 1; } elsif ( $el->{Name} eq 'category' ) { $self->{record}{category}{start_element} = $el; $self->{in_category} = 1; } else { # go ahead and forward upstream $self->SUPER::start_element( $el ); } } sub characters { my ($self, $chars) = @_; if ( $self->{in_title} ) { $self->{record}{title}{characters} = $chars; } elsif ( $self->{in_category} ) { $self->{record}{category}{characters} = $chars; } else { # go ahead and forward upstream $self->SUPER::characters( $chars ); } } sub end_element { my($self, $el) = @_; if ( $el->{Name} eq 'title' ) { $self->{record}{title}{end_element} = $el; $self->{in_title} = 0; } elsif ( $el->{Name} eq 'category' ) { $self->{record}{category}{end_element} = $el; $self->{in_category} = 0; } elsif ( $el->{Name} eq 'record' ) { # transform category my $r = $self->{record}; my @extra_cats = split(' ', $r->{title}{characters}{Data}); $r->{category}{characters}{Data} .= ',' . join(',', @extra_cats); while( my( $node, $data ) = each( %{$self->{record}} ) ) { $self->SUPER::start_element( $data->{start_element} ); $self->SUPER::characters( $data->{characters} ); $self->SUPER::end_element( $data->{end_element} ); } $self->SUPER::end_element( $el ); } else { # go ahead and forward upstream $self->SUPER::end_element( $el ); } } package main; __DATA__ <records> <record> <title>Learning Perl</title> <category>Programming</category> <publisher>O&apos;Reilly</publisher> <url>http://www.oreilly.com/catalog/learnperl4/</url> </record> <record> <publisher>O&apos;Reilly</publisher> <category>Programming</category> <title>Learning Ruby</title> <url>http://www.oreilly.com/catalog/9780596529864/index.html</url> </record> <record> <title>Learning Python</title> <publisher>O&apos;Reilly</publisher> <category>Programming</category> <url>http://www.oreilly.com/catalog/lpython2/index.html</url> </record> </records>

    The main thing to note here is that the custom filter is a subclass of XML::SAX::Base. This allows us to only add the callbacks we need to and provides defaults for the others. It also provides a first class object that lets us store arbitrary data between callbacks.

    The start_element and characters callbacks store node information about the title and category nodes. Notices how ::SUPER is not called for these elements. This effectively removes this information from the SAX stream.

    The end_element callback does the same thing as the other two when dealing with the title and category nodes. It also provides the functionality I describe in the specification when dealing with the record node's end tag.

    Note how in the XML document that the properties are not in the same order for each record, but since we do not deal with the specified properties until the end_element callback for the record, the functionality is not affected.

    Lets run the program:

    $ perl modify_category.pl <?xml version='1.0'?> <records> <record> <publisher>O&apos;Reilly</publisher> <url>http://www.oreilly.com/catalog/learnperl4/</url> <category>Programming,Learning,Perl</category> <title>Learning Perl</title></record> <record> <publisher>O&apos;Reilly</publisher> <url>http://www.oreilly.com/catalog/9780596529864/index.html</url> <category>Programming,Learning,Ruby</category> <title>Learning Ruby</title></record> <record> <publisher>O&apos;Reilly</publisher> <url>http://www.oreilly.com/catalog/lpython2/index.html</url> <category>Programming,Learning,Python</category> <title>Learning Python</title></record> </records>

    Finally, it should be pretty fast for you. I've processed files that were 10s of gigabytes in size doing the same types of things, and it performed quite nicely.

    Regards,

    trwww

      I don't know whether it's against the rules of XML (I seriously doubt it!) but I've met way too many cases when the other party insisted that the tags are in a specific order so I would be very wary of changing the order. I'd suggest XML::Twig or XML::Rules for this and processing the whole records. If there are hundreds of thousands of them i do think it's safe to assume the individual records are rather small.

        Hi Jenda

        I don't know whether it's against the rules of XML (I seriously doubt it!)

        I was speaking out of turn. I'm not sure either. It is easy to prove by defining it in either a document declaration or a xml schema. If that is possible, it is legal.

        but I've met way too many cases when the other party insisted that the tags are in a specific order so I would be very wary of changing the order.

        An artificial limitation with things like XPath. But I can see how it happens, too.

        If there are hundreds of thousands of them i do think it's safe to assume the individual records are rather small.

        My experience is different :-) But yes if I can assume the records are smallish, your advice is better in many ways (a lot more code is already written for you). I should have mentioned that.

        trwww

      Thank you for your enlightening example!
      In my particular case however, the order of the elements is indeed significant since it's dictated by both the XML Schema and the DTD.