in reply to How to read compressed (gz) file in xml::twig

Hi CSharma,

For your first question, see IO::Uncompress::Gunzip, the following works for me:

use IO::Uncompress::Gunzip (); use XML::Twig; my $z = IO::Uncompress::Gunzip->new('in.xml.gz') or die "gunzip failed: $IO::Uncompress::Gunzip::GunzipError\n"; my $twig = XML::Twig->new; $twig->parse($z); $z->close;

As for your second question, as far as I can tell you haven't provided enough information to reproduce the problem, see SSCCE. Also, I'm not sure how this question relates to the first - if this is a separate question, you should probably put it in a separate post.

Hope this helps,
-- Hauke D

Replies are listed 'Best First'.
Re^2: How to read compressed (gz) file in xml::twig
by CSharma (Sexton) on Feb 20, 2017 at 12:04 UTC

    Thanks Hauke for the solution! That works but script takes a lot time in providing the output. Can this be reduced? Gzipped file is of 45MB. Total children (Offer) are 158K.

    And, second question wasn't related to the first; that was different. Here is the code snippet.
    my $file = 'Offerfeed_11742413_uk.full.xml.gz'; my $z = IO::Uncompress::Gunzip->new($file) or die "gunzip failed: $IO: +:Uncompress::Gunzip::GunzipError\n"; my $twig = new XML::Twig; ## Get twig object $twig->parse($z); ## parse the file to build twig my $root = $twig->root; ## Get the root element of twig my @elements = $root->children; ## Get elements list of twig my $ct = 0; foreach my $e (sort @elements){ my $cpc = ($e->first_child('EstimatedCPC')->text)*100; print $cpc,"\n"; $ct++; } print $ct,"\n";

      45MB unzipped is going to be a lot of data, take a look at some of these file sizes, XML vs gzip. Either profile your code to see if improvements can be made (see the documentation for advice on huge documents), or invest in faster CPU, disks, much more RAM...

      Hi CSharma,

      but script takes a lot time in providing the output

      How long does it take to gunzip the file and then process it with your existing script? How much longer does the above code take? To get a somewhat decent comparison, try piping the output of gunzip into your script (it'll need a slight modification to read from STDIN).

      One thing that might* speed things up is if you make use of XML::Twig's ability to parse an XML file in chunks, instead of reading the whole thing into memory like you're currently doing.

      use warnings; use strict; use IO::Uncompress::Gunzip (); use XML::Twig; my $z = IO::Uncompress::Gunzip->new('in.xml.gz') or die "gunzip failed: $IO::Uncompress::Gunzip::GunzipError\n"; my $twig = XML::Twig->new( twig_roots => { '/CatalogListings/Offer/EstimatedCPC' => sub { my ($t, $elt) = @_; print $elt->text*100, "\n"; $t->purge; }, }, ); $twig->parse($z); $z->close;

      This produces the same output as before, but discards each <EstimatedCPC> element when it's done processing it, and ignores the other elements.

      (* The code works, but I haven't had the chance to do a performance test.)

      Hope this helps,
      -- Hauke D

        Thanks a lot Hauke!! The code worked and it's certainly better than earlier. Thanks, Chetan