Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

XML::Twig::flush() and html/xml entities

by mandarin (Hermit)
on Oct 05, 2006 at 12:20 UTC ( [id://576518]=perlquestion: print w/replies, xml ) Need Help??

mandarin has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I want to convert one xml file to another. I am using XML::Twig, version 3.23 and perl 5.8.3.
The input file contains HTML-Entities (ie & in the example below) in some text elements.
I do not want them to be changed, but on flushing the parent element, they are converted to the representing character (&), yielding an invalid xml file.
So the question is how to prevent XML::Twig from converting the entities.
Any hints?
Thanks in advance,
Martin

The script:
#!/usr/bin/perl use strict; use File::Basename; use XML::Twig; binmode(STDOUT, ":utf8"); my ($inFile)=@ARGV; unless (open ("INFILE", "<:utf8", "$inFile")){ die "$inFile: No such file or directory."; } my $t= XML::Twig->new( twig_handlers => { "identifier" => \&section, "record" => sub { $_[0]->flush; } }, ); $t->set_pretty_print("nice"); $t->set_keep_encoding; $t->parsefile($inFile); exit 0; sub section { my( $t, $elt)= @_; my $elt_txt = $elt->text; my $new_elt; if ($elt_txt =~ /^http:\/\/arxiv.org/) { $new_elt = XML::Twig::Elt->new( $elt->tag . ".url" => $elt_txt +); } elsif ( $elt_txt =~ /^doi:/ ) { $new_elt = XML::Twig::Elt->new( $elt->tag . '.doi' => $elt_txt +); } $elt->replace_with($new_elt) if $new_elt; }
and an example input file:
<?xml version="1.0" encoding="UTF-8"?> <harvest> <record> <header> <datestamp>2005-09-18</datestamp> <setSpec>cs</setSpec> </header> <metadata> <title>Memory-Based Lexical Acquisition and Processing</title> <creator>Daelemans, Walter</creator> <subject>Computation &amp; Language</subject> <subject>Computer Science - Computation &amp; Language</subject> <description>Comment: 18 pages</description> <date>1994-05-16</date> <type>text</type> <identifier>http://arxiv.org/abs/cmp-lg/9405018</identifier> <identifier>Steffens (ed.) Machine Translation &amp; Lexion. Springer +, 1995</identifier> </metadata> </record> </harvest>

Replies are listed 'Best First'.
Re: XML::Twig::flush() and html/xml entities
by mirod (Canon) on Oct 05, 2006 at 12:40 UTC

    That's because you're using text to get the content of the element. Form the docs:

    Return a string consisting of all the PCDATA and CDATA in an element, without any tags. The text is not XML-escaped: base entities such as & and < are not escaped.

    What you are looking for is xml_text (Return the text of the element, encoded, without any tag.) or xml_string (returns the string for the entire element, excluding the element's tags (but nested element tags are present).

      I changed the code in line 28 to both
      my $elt_txt = $elt->xml_text;
      and
      my $elt_txt = $elt->xml_string;
      (and left anything else alone)
      but neither did work out, I still get the convertet ampersand.
      Do I have to play around with the output_filter, too?
      I must admit that I didn't quite understand how that works.

      Martin

      update: Even text in tags never touched by the section
      function is undergoing changes.

        Ooops! That will teach me to test the code before answering. Is there any reason why you use the keep_encoding option? Without it it works fine, with it, indeed the & is not escaped in the output. You should only use this option if you are dealing with non-utf8 encodings, and want all the processing to be done in the original encoding, which doesn't seem to be your case.

Re: XML::Twig::flush() and html/xml entities
by Tanktalus (Canon) on Oct 05, 2006 at 15:24 UTC

    I'm going to seriously simplify your question. Here's the code:

    #!/usr/bin/perl use strict; use XML::Twig; binmode(STDOUT, ":utf8"); my $t= XML::Twig->new(); $t->set_keep_encoding; $t->parse(do { local $/; <DATA>}); $t->flush; exit 0; __END__ <?xml version="1.0" encoding="UTF-8"?> <harvest> <subject>Computation &amp; Language</subject> <subject>Computer Science - Computation &amp; Language</subject> </harvest>
    And here's the output:
    <?xml version="1.0" encoding="UTF-8"?> <harvest><subject>Computation & Language</subject><subject>Computer Sc +ience - Computation & Language</subject></harvest>
    And you want to change the &'s to &amp;'s. The solution seems to be to remove the call to set_keep_encoding. When I remove that, the output becomes what you want. Whether that's a bug in the keep-encoding or the flush or whatever, I don't know. Hopefully mirod can help here ;-)

    Update: It appears I was a few minutes behind mirod on this. Oops. :-)

      Indeed, there is a bug in set_keep_encoding.. If you put the option in the new, then the code runs fine. I have to look at it, a naive attempt at fixing it generates a boatload of errors in the tests.

      I can't remember exactly, when and why I invented the call to set_keep_encoding but I think it was due to problems with the encoding of the output.
      Maybe those where fixed by inventing the binmode(STDOUT,":utf8") line.
      I'm as quite new to Perl as to xml as to utf-8, so development went somewhat on a trial and error basis ;-)
      Anyway, it works without set_keep_encoding. Fine :-)
      Thanks a lot to anyone helping, esp. you and mirod
Re: XML::Twig::flush() and html/xml entities
by Hofmator (Curate) on Oct 05, 2006 at 12:35 UTC
    Take a look at the output_filter argument to the constructor, see XML::Twig, the value 'html' looks promising.

    Update: Apparently it only looked promising ... that's what I get for a quick look at the docs without actually trying it out, thanks for the correction, mirod!

    -- Hofmator

    Code written by Hofmator and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      Actually, if you get the content of the element using text, the output_filter will not be applied. So no, that won't work (see below what will work).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://576518]
Approved by Limbic~Region
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-04-16 20:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found