in reply to Modifying Records with XML::SAX::ByRecord
Hello,
I have a large XML document... I need to change the category element of the record depending on certain aspects of the title element. I was thinking about using XML::SAX::ByRecord for this.
I don't think ByRecord is applicable. I think I'd only want to use it when I want to do something with the entire record. You only want to modify a single node in a record, albeit based on the value of another node in the record.
My guess is once I see the category element I have to collect data until I come across the title element, modify the category and then somehow "release" all the stuff that I have collected.
Basically. Your code shouldn't take in to account what order the record properties are in because thats against the rules of XML.
What I would do, translated from SAX to English, is buffer the category and title nodes until the record's end_element is fired, temporarily removing them from the SAX stream. Then when the record's end_element callback is fired, I'd do the transformation and add the title and category back to the SAX stream.
I'm not sure how this is done in a XML::SAX filter.
Lets look at an example. Lets say the specification is to split the title on spaces and add them to the category as a comma seperated string. So a node that looks like this:
<record> <title>Learning Perl</title> <category>Programming</category> </record>
Will, after it comes out the other side of the SAX stream, look like this:
<record> <title>Learning Perl</title> <category>Programming,Learning,Perl</category> </record>
The following is a self contained program that does just that. More specifically, it filters the XML in the __DATA__ section, runs it through the custom XML::Filter::Category module, and prints the result to STDOUT:
use warnings; use strict; use XML::SAX::Machines qw(Pipeline); my $m = Pipeline( 'XML::Filter::BufferText' => 'XML::Filter::Category' => \*STDOUT ); $m->parse_file( *DATA ); package XML::Filter::Category; use base qw(XML::SAX::Base); # <marker language="foo" /> # $el->{Name} == 'marker' # $el->{Attributes}{'{}language'} == language attribute # $el->{Attributes}{'{}language'}{Value} == 'foo' sub start_element { my($self, $el) = @_; if ( $el->{Name} eq 'title' ) { $self->{record}{title}{start_element} = $el; $self->{in_title} = 1; } elsif ( $el->{Name} eq 'category' ) { $self->{record}{category}{start_element} = $el; $self->{in_category} = 1; } else { # go ahead and forward upstream $self->SUPER::start_element( $el ); } } sub characters { my ($self, $chars) = @_; if ( $self->{in_title} ) { $self->{record}{title}{characters} = $chars; } elsif ( $self->{in_category} ) { $self->{record}{category}{characters} = $chars; } else { # go ahead and forward upstream $self->SUPER::characters( $chars ); } } sub end_element { my($self, $el) = @_; if ( $el->{Name} eq 'title' ) { $self->{record}{title}{end_element} = $el; $self->{in_title} = 0; } elsif ( $el->{Name} eq 'category' ) { $self->{record}{category}{end_element} = $el; $self->{in_category} = 0; } elsif ( $el->{Name} eq 'record' ) { # transform category my $r = $self->{record}; my @extra_cats = split(' ', $r->{title}{characters}{Data}); $r->{category}{characters}{Data} .= ',' . join(',', @extra_cats); while( my( $node, $data ) = each( %{$self->{record}} ) ) { $self->SUPER::start_element( $data->{start_element} ); $self->SUPER::characters( $data->{characters} ); $self->SUPER::end_element( $data->{end_element} ); } $self->SUPER::end_element( $el ); } else { # go ahead and forward upstream $self->SUPER::end_element( $el ); } } package main; __DATA__ <records> <record> <title>Learning Perl</title> <category>Programming</category> <publisher>O'Reilly</publisher> <url>http://www.oreilly.com/catalog/learnperl4/</url> </record> <record> <publisher>O'Reilly</publisher> <category>Programming</category> <title>Learning Ruby</title> <url>http://www.oreilly.com/catalog/9780596529864/index.html</url> </record> <record> <title>Learning Python</title> <publisher>O'Reilly</publisher> <category>Programming</category> <url>http://www.oreilly.com/catalog/lpython2/index.html</url> </record> </records>
The main thing to note here is that the custom filter is a subclass of XML::SAX::Base. This allows us to only add the callbacks we need to and provides defaults for the others. It also provides a first class object that lets us store arbitrary data between callbacks.
The start_element and characters callbacks store node information about the title and category nodes. Notices how ::SUPER is not called for these elements. This effectively removes this information from the SAX stream.
The end_element callback does the same thing as the other two when dealing with the title and category nodes. It also provides the functionality I describe in the specification when dealing with the record node's end tag.
Note how in the XML document that the properties are not in the same order for each record, but since we do not deal with the specified properties until the end_element callback for the record, the functionality is not affected.
Lets run the program:
$ perl modify_category.pl <?xml version='1.0'?> <records> <record> <publisher>O'Reilly</publisher> <url>http://www.oreilly.com/catalog/learnperl4/</url> <category>Programming,Learning,Perl</category> <title>Learning Perl</title></record> <record> <publisher>O'Reilly</publisher> <url>http://www.oreilly.com/catalog/9780596529864/index.html</url> <category>Programming,Learning,Ruby</category> <title>Learning Ruby</title></record> <record> <publisher>O'Reilly</publisher> <url>http://www.oreilly.com/catalog/lpython2/index.html</url> <category>Programming,Learning,Python</category> <title>Learning Python</title></record> </records>
Finally, it should be pretty fast for you. I've processed files that were 10s of gigabytes in size doing the same types of things, and it performed quite nicely.
Regards,
trwww
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Modifying Records with XML::SAX::ByRecord
by Jenda (Abbot) on Aug 31, 2007 at 00:23 UTC | |
by trwww (Priest) on Aug 31, 2007 at 03:51 UTC | |
|
Re^2: Modifying Records with XML::SAX::ByRecord
by Lorphos (Novice) on Sep 03, 2007 at 08:54 UTC |