comment on

I've been out of the loop a long time, and I'm having a little trouble wrapping my head around xml parsing & namespaces. I'm trying to use XPathContext as I've read, but I still have to spell out everything in findnodes() and the context doesn't have any effect at all if I comment it out. What am I doing wrong? Is this the correct approach to parsing the document, to extract a single column of words from the table? The code below does work. I’m just not sure it’s the correct way to do it...

use strict;
use warnings;

use XML::LibXML;

use open ':std', ':encoding(UTF-16)';

use constant XML_WORD_COLUMN => 1;

my $filename = 'Concordance.xml';
open my $fh, '<', $filename
    or die "Can't open $filename: $!";
binmode $fh, ':raw'; # drop PerlIO layers on this handle
my $doc = XML::LibXML->load_xml(IO => $fh);

# ===> This doesn't matter <===
my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs( o    => "urn:schemas-microsoft-com:office:office"   
+   );
$xpc->registerNs( x    => "urn:schemas-microsoft-com:office:excel"    
+   );
$xpc->registerNs( ss   => "urn:schemas-microsoft-com:office:spreadshee
+t" );
$xpc->registerNs( html => "http://www.w3.org/TR/REC-html40"           
+   );
$xpc->registerNs( def  => "urn:schemas-microsoft-com:office:spreadshee
+t" );

my $table = $xpc->findnodes(q{//ss:Worksheet[@ss:Name='Sheet 1']/ss:Ta
+ble/ss:Row})
    or die "Can't find table in Worksheet 'Sheet 1': $!";

foreach my $row ($table->get_nodelist) {
    my $col_index = 1;
    foreach my $cell ($row->nonBlankChildNodes) {
        if ($col_index++ == XML_WORD_COLUMN) {
            my $d = $cell->find('./ss:Data');
            print $d->to_literal, "\n";
        }
    }
}

__END__
[download]

<?xml version="1.0" encoding="utf-8"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="u
+rn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsof
+t-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40
+" xmlns="urn:schemas-microsoft-com:office:spreadsheet">
    <Worksheet ss:Name="Sheet 1">
        <Table>
            <Row>
                <Cell>
                    <Data ss:Type="String">Word</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Count</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">Aaron</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">330</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">Aaron’s</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">25</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">Abaddon</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">7</Data>
                </Cell>
            </Row>
            <!-- Blah Blah Blah -->
        </Table>
        <x:WorksheetOptions>
            <x:FreezePanes />
            <x:FrozenNoSplit />
            <x:SplitHorizontal>1</x:SplitHorizontal>
            <x:TopRowBottomPane>1</x:TopRowBottomPane>
            <x:ActivePane>2</x:ActivePane>
        </x:WorksheetOptions>
    </Worksheet>
</Workbook>
[download]

In reply to XML Namespaces by simsrw73

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.