dominic01 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to convert a XML into JSON. I am converting XML into a Hashref and then using JSON::Any to JSON. I am struck in the first part i.e. XML to HashRef. My source XML

<Publisher> <UniqueDOI>978-3-642-123456</UniqueDOI> <ChapterInfo ChapterType="OriginalPaper"> <Title Language="En">Is Light Blue (<Emphasis Type="Italic">azzurr +o</Emphasis>) Color Name Universal in the Italian Language?</Title> </ChapterInfo> </Publisher>
I tried with XML::Simple
use XML::Simple; use Data::Dumper; my $XMLRef = XMLin('new.xml'); print Dumper $XMLRef;
Using XML::LibXML::Simple
use XML::LibXML::Simple (); use Data::Dumper; my $xs = XML::LibXML::Simple->new(%options); my $XMLRef = $xs->XMLin('new.xml'); print Dumper $XMLRef;
Using XML::Twig
use XML::Twig; use Data::Dumper; my $twig=XML::Twig->new(); my $XMLRef = $twig->parsefile( 'new.xml' )->simplify(); print Dumper $XMLRef;

Update: All the above scripts are NOT properly converting the child "Emphasis". The error part of the dump is more or less similar to

'ChapterTitle' => { 'Emphasis' => { 'Type' => 'Italic', 'content' => 'azzurro }, 'Language' => 'En', 'content' => [ 'Is Light Blue (', ') Color Name Univeral in the Italian Language?' ] },
Is it possible to do like (leave specific tags as it is)
'ChapterTitle' => { 'Language' => 'En', 'content' => [ 'Is Light Blue (<Emphasis Type="Italic">azzurro</Emphasis>) Co +lor Name Univeral in the Italian Language?' ] },

I appreciate any pointers in this regard.

Regards Dominic

Replies are listed 'Best First'.
Re: XML to HashRef and then to JSON
by Corion (Patriarch) on Mar 14, 2016 at 16:04 UTC

    In what way do all these example convert the Emphasis tag wrong?

    Update: Now that you've provided some examples, I highly doubt that any of the generic ways to parse your XML will know that "inner tags" are not really tags. The easiest way is likely to reconstruct your title tag by rolling up the children tags with it. Most modules provide a ->as_XML method which you can use to turn inner tags back to their string representation.

Re: XML to HashRef and then to JSON
by tangent (Parson) on Mar 15, 2016 at 01:18 UTC
    All the above scripts are NOT properly converting the child "Emphasis"
    That's not quite true because as far as the parser is concerned the "Emphasis" is a valid XML tag. You will have to do a bit of manual labour to achieve your desired output.

    I couldn't find a way to get the inner content of a node without getting the node's tags as well, so needed to use a regular expression to remove them. Hopefully this will get you on your way:

    use Data::Dumper; use XML::LibXML; my $xml = q| <Publisher> <UniqueDOI>978-3-642-123456</UniqueDOI> <ChapterInfo ChapterType="OriginalPaper"> <Title Language="En">Is Light Blue (<Emphasis Type="Italic">az +zurro</Emphasis>) Color Name Universal in the Italian Language?</Titl +e> </ChapterInfo> </Publisher> |; my $doc = XML::LibXML->load_xml(string => $xml); my @Publishers = $doc->findnodes('//Publisher'); for my $Publisher ( @Publishers ) { my ($ChapterInfo) = $Publisher->findnodes('ChapterInfo'); my ($Title) = $ChapterInfo->findnodes('Title'); # get the Title node as literal XML my $content = $Title->toString(); print "Title content:\n$content\n"; # remove first and last XML tags $content =~ s/^<[^>]*>(.*)<[^>]*>$/$1/; # construct the hash reference my $hash = { UniqueDOI => $Publisher->findvalue('UniqueDOI'), ChapterInfo => { ChapterType => $ChapterInfo->getAttribute('ChapterType'), Title => { Language => $Title->getAttribute('Language'), content => $content, }, }, }; print Dumper($hash); }
    See XML::LibXML::Node for explanation of these methods.

    Output:

    Title content: <Title Language="En">Is Light Blue (<Emphasis Type="Italic">azzurro</E +mphasis>) Color Name Universal in the Italian Language?</Title> $VAR1 = { 'UniqueDOI' => '978-3-642-123456', 'ChapterInfo' => { 'ChapterType' => 'OriginalPaper' 'Title' => { 'Language' => 'En', 'content' => 'Is Light Blue (<Emphasis Type="Italic">azzur +ro</Emphasis>) Color Name Universal in the Italian Language?' }, }, };
      I accept your answer. What I have provided in my OP was just a sample and my XML is big and I manipulated the XML before converting it to a HashRef.
      for $TmpNode ($dom->findnodes('//Emphasis')) { $tStr = $TmpNode->toString(1); $new_node = $dom->createTextNode( "$tStr" ); $TmpNode->replaceNode($new_node); }
Re: XML to HashRef and then to JSON
by LanX (Saint) on Mar 14, 2016 at 16:11 UTC
    I suppose your problem is that the emphasis should be "around" the second entry of content.

    In order to isolate the problem please show us a dump of the hashref too.°

    I suppose it's a question of specifying the XML format (DTD, Schema,...)

    But there are still edge cases where JSON is the wrong format though ...

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    ° never mind, read JSON in the title and oversaw the fat commas => :)

      Update: I have provided the error part of the HashRef. Please note.
Re: XML to HashRef and then to JSON
by Jenda (Abbot) on Mar 17, 2016 at 17:30 UTC

    You can use XML::Rules and specify what tags do you want to keep as text. Something like this:

    use strict; use XML::Rules; use Data::Dumper; my $parser = XML::Rules->new( stripspaces => 2, rules => { '_default' => 'as array', 'Publisher' => 'pass', 'Emphasis' => sub { my ($tag,$attr,$parser) = @_[0,1,4]; return $parser->ToXML($tag, $attr); } } ); print Dumper($parser->parse(\*DATA)); __DATA__ <Publisher> <UniqueDOI>978-3-642-123456</UniqueDOI> <ChapterInfo ChapterType="OriginalPaper"> <Title Language="En">Is Light Blue (<Emphasis Type="Italic">azzurr +o</Emphasis> o bianco) Color Name Universal in the Italian Language?< +/Title> </ChapterInfo> </Publisher>

    You can specify a comma separated list of tags in place of the 'Emphasis' and if some other tags are not allowed to repeat, you may include 'their,names' => 'as is' in the rules hash.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.