How to read compressed (gz) file in xml::twig

CSharma has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Purpose: I've hundreds of compressed (.gz) xml files from which I've to create distribution of values of 'EstimatedCPC' tags from xml files. Issue: I'm not sure how to read gzipped file in xml::twig; I don't want to unzip each file then read. Is there any way to read directly? Please help. Also code:2 is throwing "Wide character in print at /usr/local/share/perl/5.22.1/XML/Twig.pm line 8628." this error. How to fix this?

Perl Code 1:

my $file = 'file.xml';
my $twig = new XML::Twig;    ## Get twig object
$twig->parsefile($file);    ## parse the file to build twig


my $root = $twig->root;        ## Get the root element of twig
my @elements = $root->children;    ## Get elements list of twig

foreach my $e (sort @elements){
    my $cpc = ($e->first_child('EstimatedCPC')->text)*100;
    print $cpc,"\n";
}
[download]

Perl Code 2:


$twig->parsefile( "file.xml");    # build the twig
my $root= $twig->root;           # get the root of the twig (stats)
my @players= $root->children;    # get the player list

                                 # sort it on the text of the field
my @sorted= sort {    $b->first_child( $field)->text 
                  <=> $a->first_child( $field)->text }
            @players;
                                 
print '<?xml version="1.0"?>';   # print the XML declaration
print '<!DOCTYPE stats SYSTEM "stats.dtd" []>';
print '<stats>';                 # then the root element start tag

foreach my $player (@sorted)     # the sorted list 
 { $player->print;               # print the xml content of the elemen
+t 
   print "\n"; 
 }
print "</stats>\n";              # close the document
[download]

Example of xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CatalogListings>
    <Offer id="af94bdd18ff9ffbf66afb5286dcb68fa">
        <Command>new</Command>
        <Title>Puma Pitch Shorts</Title>
        <Description>Pitch Shorts: Let your football team look like pr
+os and play like pros with these lightweight shorts from PUMA. Highly
+ functional materials draw sweat away from your skin and help keep yo
+u dry and comfortable during exercise. Get ready for dry with dryCELL
+. Bio-based wicking finish to keep you dry.</Description>
        <EstimatedCPC>0.0434</EstimatedCPC>
        <LastModified>2017-02-15 21:31:41</LastModified>
        <Images>
            <Image available="true">
                <Url>http://r.kelkoo.com/r/uk/11210623/100353523/90/90
+/http%3A%2F%2Fpumaecom.scene7.com%2Fis%2Fimage%2FPUMAECOM%2F702075_25
+_01_EEA%3F%24PUMA_GRID%24/d4qCltxt.0XARAbgLGRcGsAKxgSY3iHhaVcF_7bEuPg
+-</Url>
            </Image>
        </Images>
        <Url>http://ecs-uk.kelkoo.co.uk/ctl/go/offersearchGo?.ts=14872
+45527057&amp;.sig=Ch1dMBKSr5hhrL8bNhlNkv_GMSg&amp;catId=100353523&amp
+;localCatId=100353523&amp;comId=11210623&amp;offerId=af94bdd18ff9ffbf
+66afb5
286dcb68fa&amp;searchId=null&amp;affiliationId=96951977&amp;country=uk
+&amp;wait=true&amp;contextLevel=2&amp;service=11</Url>
        <MobileFriendly>false</MobileFriendly>
        <Merchant id="11210623"/>
        <Category id="100353523">
            <Name>Miscellaneous</Name>
        </Category>
        <Price currency="GBP">
            <Price>20.0</Price>
            <DeliveryCost>3.95</DeliveryCost>
            <TotalPrice>23.95</TotalPrice>
        </Price>
        <ProductClass>0</ProductClass>
        <Availability>1</Availability>
        <OffensiveContent>false</OffensiveContent>
        <Ean>4055261425365</Ean>
        <MerchantCategory>Male|Mens  Sports  Football  Pants &amp; Sho
+rts</MerchantCategory>
        <Brand>Puma</Brand>
        <BrandId>2571</BrandId>
        <Model>Pitch Shorts</Model>
        <Currency>GBP</Currency>
    </Offer>
</CatalogListings>
[download]

Comment on How to read compressed (gz) file in xml::twig Select or Download Code

Replies are listed 'Best First'.
Re: How to read compressed (gz) file in xml::twig by haukex (Archbishop) on Feb 20, 2017 at 10:07 UTC
Hi CSharma, For your first question, see IO::Uncompress::Gunzip, the following works for me: `use IO::Uncompress::Gunzip (); use XML::Twig; my $z = IO::Uncompress::Gunzip->new('in.xml.gz') or die "gunzip failed: $IO::Uncompress::Gunzip::GunzipError\n"; my $twig = XML::Twig->new; $twig->parse($z); $z->close;` [download] As for your second question, as far as I can tell you haven't provided enough information to reproduce the problem, see SSCCE. Also, I'm not sure how this question relates to the first - if this is a separate question, you should probably put it in a separate post. Hope this helps, -- Hauke D	[reply] [d/l]
Re^2: How to read compressed (gz) file in xml::twig by CSharma (Sexton) on Feb 20, 2017 at 12:04 UTC
Thanks Hauke for the solution! That works but script takes a lot time in providing the output. Can this be reduced? Gzipped file is of 45MB. Total children (Offer) are 158K. And, second question wasn't related to the first; that was different. Here is the code snippet. my $file = 'Offerfeed_11742413_uk.full.xml.gz'; my $z = IO::Uncompress::Gunzip->new($file) or die "gunzip failed: $IO: +:Uncompress::Gunzip::GunzipError\n"; my $twig = new XML::Twig; ## Get twig object $twig->parse($z); ## parse the file to build twig my $root = $twig->root; ## Get the root element of twig my @elements = $root->children; ## Get elements list of twig my $ct = 0; foreach my $e (sort @elements){ my $cpc = ($e->first_child('EstimatedCPC')->text)*100; print $cpc,"\n"; $ct++; } print $ct,"\n"; [download]	[reply] [d/l]
Re^3: How to read compressed (gz) file in xml::twig by marto (Cardinal) on Feb 20, 2017 at 12:11 UTC
45MB unzipped is going to be a lot of data, take a look at some of these file sizes, XML vs gzip. Either profile your code to see if improvements can be made (see the documentation for advice on huge documents), or invest in faster CPU, disks, much more RAM...	[reply]
Re^3: How to read compressed (gz) file in xml::twig by haukex (Archbishop) on Feb 20, 2017 at 16:07 UTC
Hi CSharma, but script takes a lot time in providing the output How long does it take to gunzip the file and then process it with your existing script? How much longer does the above code take? To get a somewhat decent comparison, try piping the output of gunzip into your script (it'll need a slight modification to read from `STDIN`). One thing that might* speed things up is if you make use of XML::Twig's ability to parse an XML file in chunks, instead of reading the whole thing into memory like you're currently doing. `use warnings; use strict; use IO::Uncompress::Gunzip (); use XML::Twig; my $z = IO::Uncompress::Gunzip->new('in.xml.gz') or die "gunzip failed: $IO::Uncompress::Gunzip::GunzipError\n"; my $twig = XML::Twig->new( twig_roots => { '/CatalogListings/Offer/EstimatedCPC' => sub { my ($t, $elt) = @_; print $elt->text100, "\n"; $t->purge; }, }, ); $twig->parse($z); $z->close;` [download] This produces the same output as before, but discards each `<EstimatedCPC>` element when it's done processing it, and ignores the other elements. ( The code works, but I haven't had the chance to do a performance test.) Hope this helps, -- Hauke D	[reply] [d/l] [select]
Re^4: How to read compressed (gz) file in xml::twig by CSharma (Sexton) on Feb 22, 2017 at 03:08 UTC