Modifying Records with XML::SAX::ByRecord

Lorphos has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Modifying Records with XML::SAX::ByRecord
by samtregar (Abbot) on Aug 29, 2007 at 17:32 UTC

XML:Validator::Schema

  my $category_stack = $self->{category_stack} ||= [];
  push @$category_stack, 
    { some => $category, data => $here };
[download]

Then when you see a title and you're ready to finish work on the category you just pop it off:

  my $category_stack = $self->{category_stack};
  my $category_data = pop @$category_stack;
[download]

Using a stack allows categories to nest. Popping off the category data means the stack is conservative and will shrink as stored data is used. You definitely want to avoid building a data structure that can contain all the category data in your file.

Does that make sense?

-sam

[reply]
[d/l]
[select]

Re^2: Modifying Records with XML::SAX::ByRecord

by Lorphos (Novice) on Sep 03, 2007 at 08:55 UTC

    if ($collecting) {
        push @event_queue, sub {
            $self->SUPER::start_element($xml_data);
        };
    }
[download]

    foreach (@event_queue) {
        &$_();
    }
[download]

[reply]
[d/l]
[select]

Re: Modifying Records with XML::SAX::ByRecord
by trwww (Priest) on Aug 29, 2007 at 20:36 UTC

Hello,

I have a large XML document... I need to change the category element of the record depending on certain aspects of the title element. I was thinking about using XML::SAX::ByRecord for this.

I don't think ByRecord is applicable. I think I'd only want to use it when I want to do something with the entire record. You only want to modify a single node in a record, albeit based on the value of another node in the record.

My guess is once I see the category element I have to collect data until I come across the title element, modify the category and then somehow "release" all the stuff that I have collected.

Basically. Your code shouldn't take in to account what order the record properties are in because thats against the rules of XML.

What I would do, translated from SAX to English, is buffer the category and title nodes until the record's end_element is fired, temporarily removing them from the SAX stream. Then when the record's end_element callback is fired, I'd do the transformation and add the title and category back to the SAX stream.

I'm not sure how this is done in a XML::SAX filter.

Lets look at an example. Lets say the specification is to split the title on spaces and add them to the category as a comma seperated string. So a node that looks like this:

  <record>
    <title>Learning Perl</title>
    <category>Programming</category>
  </record>
[download]

Will, after it comes out the other side of the SAX stream, look like this:

  <record>
    <title>Learning Perl</title>
    <category>Programming,Learning,Perl</category>
  </record>
[download]

The following is a self contained program that does just that. More specifically, it filters the XML in the __DATA__ section, runs it through the custom XML::Filter::Category module, and prints the result to STDOUT:

use warnings;
use strict;

use XML::SAX::Machines qw(Pipeline);

my $m = Pipeline(
  'XML::Filter::BufferText' =>
  'XML::Filter::Category'   =>
  \*STDOUT
);

$m->parse_file( *DATA );

package XML::Filter::Category;
use base qw(XML::SAX::Base);

# <marker language="foo" />
# $el->{Name} == 'marker'
# $el->{Attributes}{'{}language'} == language attribute
# $el->{Attributes}{'{}language'}{Value} == 'foo'

sub start_element {
  my($self, $el) = @_;
  if ( $el->{Name} eq 'title' ) {
    $self->{record}{title}{start_element} = $el;
    $self->{in_title} = 1;
  } elsif ( $el->{Name} eq 'category' ) {
    $self->{record}{category}{start_element} = $el;
    $self->{in_category} = 1;
  } else { # go ahead and forward upstream
    $self->SUPER::start_element( $el );
  }
}

sub characters {
  my ($self, $chars) = @_;
  if ( $self->{in_title} ) {
    $self->{record}{title}{characters} = $chars;
  } elsif ( $self->{in_category} ) {
    $self->{record}{category}{characters} = $chars;
  } else { # go ahead and forward upstream
    $self->SUPER::characters( $chars );
  }
}

sub end_element {
  my($self, $el) = @_;

  if ( $el->{Name} eq 'title' ) {
    $self->{record}{title}{end_element} = $el;
    $self->{in_title} = 0;
  } elsif ( $el->{Name} eq 'category' ) {
    $self->{record}{category}{end_element} = $el;
    $self->{in_category} = 0;
  } elsif ( $el->{Name} eq 'record' ) { # transform category
    my $r = $self->{record};

    my @extra_cats = split(' ', $r->{title}{characters}{Data});
    $r->{category}{characters}{Data} .= ',' . join(',', @extra_cats);

    while( my( $node, $data ) = each( %{$self->{record}} ) ) {
      $self->SUPER::start_element( $data->{start_element} );
      $self->SUPER::characters( $data->{characters} );
      $self->SUPER::end_element( $data->{end_element} );
    }

    $self->SUPER::end_element( $el );
  } else { # go ahead and forward upstream
    $self->SUPER::end_element( $el );
  }
}

package main;

__DATA__
<records>
  <record>
    <title>Learning Perl</title>
    <category>Programming</category>
    <publisher>O&apos;Reilly</publisher>
    <url>http://www.oreilly.com/catalog/learnperl4/</url>
  </record>
  <record>
    <publisher>O&apos;Reilly</publisher>
    <category>Programming</category>
    <title>Learning Ruby</title>
    <url>http://www.oreilly.com/catalog/9780596529864/index.html</url>
  </record>
  <record>
    <title>Learning Python</title>
    <publisher>O&apos;Reilly</publisher>
    <category>Programming</category>
    <url>http://www.oreilly.com/catalog/lpython2/index.html</url>
  </record>
</records>
[download]

The main thing to note here is that the custom filter is a subclass of XML::SAX::Base. This allows us to only add the callbacks we need to and provides defaults for the others. It also provides a first class object that lets us store arbitrary data between callbacks.

The start_element and characters callbacks store node information about the title and category nodes. Notices how ::SUPER is not called for these elements. This effectively removes this information from the SAX stream.

The end_element callback does the same thing as the other two when dealing with the title and category nodes. It also provides the functionality I describe in the specification when dealing with the record node's end tag.

Note how in the XML document that the properties are not in the same order for each record, but since we do not deal with the specified properties until the end_element callback for the record, the functionality is not affected.

Lets run the program:

$ perl modify_category.pl
<?xml version='1.0'?>
<records>
  <record>
    <publisher>O&apos;Reilly</publisher>
    <url>http://www.oreilly.com/catalog/learnperl4/</url>
    <category>Programming,Learning,Perl</category>
    <title>Learning Perl</title></record>
  <record>
    <publisher>O&apos;Reilly</publisher>
    <url>http://www.oreilly.com/catalog/9780596529864/index.html</url>
    <category>Programming,Learning,Ruby</category>
    <title>Learning Ruby</title></record>
  <record>
    <publisher>O&apos;Reilly</publisher>
    <url>http://www.oreilly.com/catalog/lpython2/index.html</url>
    <category>Programming,Learning,Python</category>
    <title>Learning Python</title></record>
</records>
[download]

Finally, it should be pretty fast for you. I've processed files that were 10s of gigabytes in size doing the same types of things, and it performed quite nicely.

Regards,

trwww

[reply]
[d/l]
[select]

Re^2: Modifying Records with XML::SAX::ByRecord

by Jenda (Abbot) on Aug 31, 2007 at 00:23 UTC

I don't know whether it's against the rules of XML (I seriously doubt it!) but I've met way too many cases when the other party insisted that the tags are in a specific order so I would be very wary of changing the order. I'd suggest XML::Twig or XML::Rules for this and processing the whole records. If there are hundreds of thousands of them i do think it's safe to assume the individual records are rather small.

Jenda
Support Denmark!
Defend the free world!

[reply]

Re^3: Modifying Records with XML::SAX::ByRecord

by trwww (Priest) on Aug 31, 2007 at 03:51 UTC

Hi Jenda

I don't know whether it's against the rules of XML (I seriously doubt it!)

I was speaking out of turn. I'm not sure either. It is easy to prove by defining it in either a document declaration or a xml schema. If that is possible, it is legal.

but I've met way too many cases when the other party insisted that the tags are in a specific order so I would be very wary of changing the order.

An artificial limitation with things like XPath. But I can see how it happens, too.

If there are hundreds of thousands of them i do think it's safe to assume the individual records are rather small.

My experience is different :-) But yes if I can assume the records are smallish, your advice is better in many ways (a lot more code is already written for you). I should have mentioned that.