chtaylo2 has asked for the wisdom of the Perl Monks concerning the following question:

I'm using XML::Simple.
I need to be able to extract childnodes inside of a parent.

the XML statement:
<HIT> <FIELD NAME="body"> <sep /> Self-Archiving E-mail Messages in <key>Outlook</key> The following e-mail self-archiving <sep /> create an Archive folder/.pst file. See <key>Outlook</key> 2003 .pst file Management for instruction <sep /> your messages. Select your <key>Files</key> (This refers to your primary... <sep /> </FIELD> </HIT>
When I do a data dump of this I get:
$VAR1 = { 'content' => [ 'Self-Archiving E-mail Messages in ', ' The following e-mail self-archiving', 'create an Archive folder/.pst file. See ', ' 2003 .pst file Management for instruction', 'your messages. Select your ', ' (This refers to your primary...' ], 'NAME' => 'body', 'sep' => [ {}, {}, {}, {} ], 'key' => [ 'Outlook', 'Outlook', 'Files' ] };


How can I extract all of this correctly so it reads:

Self-Archiving E-mail Messages in Outlook The following e-mail self-archivingcreate an Archive folder/.pst file. See Outlook 2003 .pst file Management for instruction your messages. Select your Files (This refers to your primary...

Thanks!!

Replies are listed 'Best First'.
Re: XML Parsing question
by eff_i_g (Curate) on Jul 08, 2010 at 17:02 UTC
    I'm a fan of XML::Twig.

    This...
    use XML::Twig; my $data; { undef $/; $data = <DATA>; } my $XML = XML::Twig->new; $XML->parse($data); print $XML->root->children_text; __DATA__ <HIT> <FIELD NAME="body"> <sep /> Self-Archiving E-mail Messages in <key>Outlook</key> The following e-mail self-archiving <sep /> create an Archive folder/.pst file. See <key>Outlook</key> 2003 .pst file Management for instruction <sep /> your messages. Select your <key>Files</key> (This refers to your primary... <sep /> </FIELD> </HIT>
    ...outputs this:
    Self-Archiving E-mail Messages in Outlook The following e-mail self-archiving create an Archive folder/.pst file. See Outlook 2003 .pst file Management for instruction your messages. Select your Files (This refers to your primary...
      I like Twig as well, however I'm on a enterprise system so have to stick with XML::Simple. Any idea how to do it with that module ?
        Nope.

        Perhaps this horribly evil concoction?
        use Data::Dumper; my $data; { undef $/; $data = <DATA>; } my @text = grep { /^\S/ } map { s/^\s+|\s+$//g; $_ } $data =~ />([^<>] ++)/g; print Dumper(\@text); __DATA__ <HIT> <FIELD NAME="body"> <sep /> Self-Archiving E-mail Messages in <key>Outlook</key> The following e-mail self-archiving <sep /> create an Archive folder/.pst file. See <key>Outlook</key> 2003 .pst file Management for instruction <sep /> your messages. Select your <key>Files</key> (This refers to your primary... <sep /> </FIELD> </HIT>
Re: XML Parsing question
by graff (Chancellor) on Jul 08, 2010 at 21:35 UTC
    If you already have XML::Simple, then you probably already have XML::Parser as well (because the former typically depends on / uses the latter).

    So here's a one-liner that does what you want via XML::Parser:

    perl -MXML::Parser -e '$p=XML::Parser->new(Handlers=>{Char=>sub{print +"$_[1] "}}); $p->parsefile("filename.xml")'
    (Note that the quotes as shown are based on using a bash shell or equivalent -- not ms-dos/cmd.exe.)

    If the amount and types of white-space you get from that are not to your liking, you could either complicate the one-liner a little bit (add tr/ \n/ /s in the sub{}), or just pipe the output to another one-liner...

    (updated last paragraph to improve the "tr///" suggestion; to clarify, here's a "really tidy" version of the one-liner:

    perl -MXML::Parser -e '$p=XML::Parser->new(Handlers=> {Char=>sub{($_=$_[1])=~tr/ \n/ /s; s/^ +$//; print}}); $p->parsefile("filename.xml"); print "\n"'
    which puts all the visible text on one line, then adds a final line-feed.)
Re: XML Parsing question
by deMize (Monk) on Jul 08, 2010 at 18:30 UTC
    With HTML::Strip
    #!/usr/local/bin/perl use strict; use warnings; use HTML::Strip; main(); sub main{ my $xml = qq{ <HIT> <FIELD NAME="body"> <sep /> Self-Archiving E-mail Messages in <key>Outlook</key> The following e-mail self-archiving <sep /> create an Archive folder/.pst file. See <key>Outlook</key> 2003 .pst file Management for instruction <sep /> your messages. Select your <key>Files</key> (This refers to your primary... <sep /> </FIELD> </HIT> }; my $hs = HTML::Strip->new(); my $text = $hs->parse($xml); print $text; }



    Demize
Re: XML Parsing question
by Jenda (Abbot) on Jul 11, 2010 at 20:04 UTC
    use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( stripspaces => 7, rules => { sep => '== ', # same as sub { return ' ' } key => sub {return " <b>$_[1]->{_content}</b> "}, FIELD => 'content by NAME', HIT => 'pass', }, ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data); __DATA__ <HIT> <FIELD NAME="body"> <sep /> Self-Archiving E-mail Messages in <key>Outlook</key> The following e-mail self-archiving <sep /> create an Archive folder/.pst file. See <key>Outlook</key> 2003 .pst file Management for instruction <sep /> your messages. Select your <key>Files</key> (This refers to your primary... <sep /> </FIELD> </HIT>

    You'd have to add rules for other tags that appear in your XML! XML::Rules is pure Perl, built on top of XML::Parser so you will have no problem installing it in your home directory or alongside the script.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: XML Parsing question
by ikegami (Patriarch) on Jul 08, 2010 at 21:38 UTC
    Alternatives have been suggested, but you asked for an XML::Simple solution. Sorry, but XML::Simple is not able to do that.