Your usage of the 'name' attribute in XHTML-ish data is misleading, names should be unique (and the name attribute is deprecated in XHTML). If you have any control over the data using the 'class' attribute would be cleaner.

Also XML::XPath is NOT a good module to use. It's slow and more importantly, it is not actively maintained. As mentioned above, XML::LibXML is a much better option, and the code will be very similar.

That said, here is a solution with XML::Twig that should be easy on the RAM (it purges the in-memory structure after each 'w' element). Note that the code is untested, because you did not give us sample data.

#!/usr/bin/perl use strict; use warnings; use XML::Twig; { my $file = $ARGV[0]; my $wns={}; # { <id> => [ <a text>, ... ], ... } my $ls={}; # same XML::Twig->new( twig_handlers => { q{/e/p//w[@id]/a[@name="wn"]} => su +b { add_value( @_, $wns); }, q{/e/p//w[@id]/a[@name="l"]} => su +b { add_value( @_, $ls ); }, # once you're done with a w element + you can get rid of it q{/e/p//w} => sub { $ +_->flush; }, }, ) ->parsefile( $file); for my $n (1 .. 32812) { next unless $ls->{$n} && $wns->{$n}; print "@{$ls->{$n}}#@{$wns->{$n}}\n"; } } # get the id and then add the text of a in the proper array sub add_value { my( $t, $a, $store)= @_; my $id= $a->parent->id; $store->{$id} ||= []; push @{$store->{$id}}, $a->text; # or xml_string if you want embed +ded tags }

In reply to Re: large XML file in using XPATH by mirod
in thread large XML file in using XPATH by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.