CSharma has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Purpose: I've hundreds of compressed (.gz) xml files from which I've to create distribution of values of 'EstimatedCPC' tags from xml files. Issue: I'm not sure how to read gzipped file in xml::twig; I don't want to unzip each file then read. Is there any way to read directly? Please help. Also code:2 is throwing "Wide character in print at /usr/local/share/perl/5.22.1/XML/Twig.pm line 8628." this error. How to fix this?

Perl Code 1:

my $file = 'file.xml'; my $twig = new XML::Twig; ## Get twig object $twig->parsefile($file); ## parse the file to build twig my $root = $twig->root; ## Get the root element of twig my @elements = $root->children; ## Get elements list of twig foreach my $e (sort @elements){ my $cpc = ($e->first_child('EstimatedCPC')->text)*100; print $cpc,"\n"; }
Perl Code 2:
$twig->parsefile( "file.xml"); # build the twig my $root= $twig->root; # get the root of the twig (stats) my @players= $root->children; # get the player list # sort it on the text of the field my @sorted= sort { $b->first_child( $field)->text <=> $a->first_child( $field)->text } @players; print '<?xml version="1.0"?>'; # print the XML declaration print '<!DOCTYPE stats SYSTEM "stats.dtd" []>'; print '<stats>'; # then the root element start tag foreach my $player (@sorted) # the sorted list { $player->print; # print the xml content of the elemen +t print "\n"; } print "</stats>\n"; # close the document
Example of xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <CatalogListings> <Offer id="af94bdd18ff9ffbf66afb5286dcb68fa"> <Command>new</Command> <Title>Puma Pitch Shorts</Title> <Description>Pitch Shorts: Let your football team look like pr +os and play like pros with these lightweight shorts from PUMA. Highly + functional materials draw sweat away from your skin and help keep yo +u dry and comfortable during exercise. Get ready for dry with dryCELL +. Bio-based wicking finish to keep you dry.</Description> <EstimatedCPC>0.0434</EstimatedCPC> <LastModified>2017-02-15 21:31:41</LastModified> <Images> <Image available="true"> <Url>http://r.kelkoo.com/r/uk/11210623/100353523/90/90 +/http%3A%2F%2Fpumaecom.scene7.com%2Fis%2Fimage%2FPUMAECOM%2F702075_25 +_01_EEA%3F%24PUMA_GRID%24/d4qCltxt.0XARAbgLGRcGsAKxgSY3iHhaVcF_7bEuPg +-</Url> </Image> </Images> <Url>http://ecs-uk.kelkoo.co.uk/ctl/go/offersearchGo?.ts=14872 +45527057&amp;.sig=Ch1dMBKSr5hhrL8bNhlNkv_GMSg&amp;catId=100353523&amp +;localCatId=100353523&amp;comId=11210623&amp;offerId=af94bdd18ff9ffbf +66afb5 286dcb68fa&amp;searchId=null&amp;affiliationId=96951977&amp;country=uk +&amp;wait=true&amp;contextLevel=2&amp;service=11</Url> <MobileFriendly>false</MobileFriendly> <Merchant id="11210623"/> <Category id="100353523"> <Name>Miscellaneous</Name> </Category> <Price currency="GBP"> <Price>20.0</Price> <DeliveryCost>3.95</DeliveryCost> <TotalPrice>23.95</TotalPrice> </Price> <ProductClass>0</ProductClass> <Availability>1</Availability> <OffensiveContent>false</OffensiveContent> <Ean>4055261425365</Ean> <MerchantCategory>Male|Mens Sports Football Pants &amp; Sho +rts</MerchantCategory> <Brand>Puma</Brand> <BrandId>2571</BrandId> <Model>Pitch Shorts</Model> <Currency>GBP</Currency> </Offer> </CatalogListings>

Replies are listed 'Best First'.
Re: How to read compressed (gz) file in xml::twig
by haukex (Archbishop) on Feb 20, 2017 at 10:07 UTC

    Hi CSharma,

    For your first question, see IO::Uncompress::Gunzip, the following works for me:

    use IO::Uncompress::Gunzip (); use XML::Twig; my $z = IO::Uncompress::Gunzip->new('in.xml.gz') or die "gunzip failed: $IO::Uncompress::Gunzip::GunzipError\n"; my $twig = XML::Twig->new; $twig->parse($z); $z->close;

    As for your second question, as far as I can tell you haven't provided enough information to reproduce the problem, see SSCCE. Also, I'm not sure how this question relates to the first - if this is a separate question, you should probably put it in a separate post.

    Hope this helps,
    -- Hauke D

      Thanks Hauke for the solution! That works but script takes a lot time in providing the output. Can this be reduced? Gzipped file is of 45MB. Total children (Offer) are 158K.

      And, second question wasn't related to the first; that was different. Here is the code snippet.
      my $file = 'Offerfeed_11742413_uk.full.xml.gz'; my $z = IO::Uncompress::Gunzip->new($file) or die "gunzip failed: $IO: +:Uncompress::Gunzip::GunzipError\n"; my $twig = new XML::Twig; ## Get twig object $twig->parse($z); ## parse the file to build twig my $root = $twig->root; ## Get the root element of twig my @elements = $root->children; ## Get elements list of twig my $ct = 0; foreach my $e (sort @elements){ my $cpc = ($e->first_child('EstimatedCPC')->text)*100; print $cpc,"\n"; $ct++; } print $ct,"\n";

        45MB unzipped is going to be a lot of data, take a look at some of these file sizes, XML vs gzip. Either profile your code to see if improvements can be made (see the documentation for advice on huge documents), or invest in faster CPU, disks, much more RAM...

        Hi CSharma,

        but script takes a lot time in providing the output

        How long does it take to gunzip the file and then process it with your existing script? How much longer does the above code take? To get a somewhat decent comparison, try piping the output of gunzip into your script (it'll need a slight modification to read from STDIN).

        One thing that might* speed things up is if you make use of XML::Twig's ability to parse an XML file in chunks, instead of reading the whole thing into memory like you're currently doing.

        use warnings; use strict; use IO::Uncompress::Gunzip (); use XML::Twig; my $z = IO::Uncompress::Gunzip->new('in.xml.gz') or die "gunzip failed: $IO::Uncompress::Gunzip::GunzipError\n"; my $twig = XML::Twig->new( twig_roots => { '/CatalogListings/Offer/EstimatedCPC' => sub { my ($t, $elt) = @_; print $elt->text*100, "\n"; $t->purge; }, }, ); $twig->parse($z); $z->close;

        This produces the same output as before, but discards each <EstimatedCPC> element when it's done processing it, and ignores the other elements.

        (* The code works, but I haven't had the chance to do a performance test.)

        Hope this helps,
        -- Hauke D