Handling very big gz.Z files

albascura has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone

I came across a problem today and I have no idea what is causing it.

Basically I wrote a script some time ago that takes in input a .txt file which has some xml tags inside and check for stuff in between the <s></s> tag, doing some stuff. I was pretty satisfied (thanks again for all the help I received here) since it worked pretty well on the small file I used as text. But today I tried running it on a big file (which, btw is gz.Z compressed) which has the very same format of the small file I used as test. And surprisingly it returns two empty files, instead of two full files as I got on the small file.

Here is the code I wrote

#! /usr/bin/perl -w
use strict;
use warnings;
use Mojo::DOM;
use File::Slurp 'slurp';
use Data::Dumper;

my $fileParole = shift;
my $neg="not";
my $neg1="no";
my $debug=shift;
my %HOH;
my %HOA;
my %HOE;
my $dom = Mojo::DOM->new->parse(scalar do { local $/; <STDIN> });

my($entry,$lemma,$pos,$id,$ref,$mod);
my @negval;
my $negvalues;
my $notref;
#my $ref1;
my $parolaNO=0;
my($entry1,$lemma1,$pos1,$id1,$ref1,$mod1);

my $neglemma;
my $newchunk="";
my $negentry;
my $i = 0;
my $line;
my @values;
my @fields;
my $el;
my $parolaNeg;

open (INPUT2,$fileParole) or die $!;
my $filepos = "positive.txt";
my $fileneg ="negative.txt";
open OUTPOS,">$filepos";
open OUTNEG,">$fileneg";

while (<INPUT2>) {
    my ($word1, $word2)= split('\t', $_);
    chomp ($word1);
    chomp ($word2);
        $HOA{$word1}  ="${word2}"; 
        $HOA{$word2}  ="${word1}"; 


}

my $negazione=0;
my $aggettivo=0;
my $idagg;
my $idneg;
my $parolaAnalisi="";


for my $chunk ( $dom->find('s')->each ) {
    $i++;
    
    @values = split('\n', $chunk);
    foreach my $val(@values){
        if (($val =~ m/<s>/)||($val =~ m/<\/s>/)) {
            next;
        }
        ($entry,$lemma,$pos,$id,$ref,$mod)= split('\t', $val);
        #$parolaAnalisi=$lemma;
        $entry =~ s/^\s+|\s+$//g;
        $lemma =~ s/^\s+|\s+$//g;
        $pos =~ s/^\s+|\s+$//g;
        #print "LEMMA: ".$lemma. " REF: ".$ref."\n";

        if(($lemma =~ m/^no$/)||($lemma =~ m/^not$/)){
            $idneg=$ref;
        }
        if ((exists $HOA{$lemma})&&($pos=~ m/^JJ$/)) {
            $parolaAnalisi=$lemma;
    
            $aggettivo=1;
            $idagg=$ref;
            $parolaNeg=$lemma;
            #print "ID AGG= ".$idagg."\n";
            #print "ID NEG= ".$idneg."\n";
            
            if ($idneg == $idagg) {
                $negazione=1;
                
                #next;
                #print "NEGAZIONE TROVATA\n";
                
            }
            
        }
            
    }
    #$i++;
    
    
    
    
    
    if (($aggettivo==1)&&($negazione==0)) {
        $HOH{$i}="${chunk}";
    
    }
    
    #print "I: ".$i."\n";
    if (($aggettivo==1)&&($negazione==1)) {
        
        foreach $negvalues(@values){
            chomp $negvalues;
        ($entry1,$lemma1,$pos1,$id1,$ref1,$mod1)= split('\t', $negvalu
+es);
                        #print $negvalues."\n";
        if (($negvalues =~ m/not/)||($negvalues =~ m/no/)) {
            next;
        }
        if($negvalues =~ m/$parolaAnalisi/){
                            #print "NOTREF: ".$notref."\n";
                                                            $negentry=
+"not_".$entry1;
            $neglemma="not_".$lemma1;
                                #print "TRUE\n";
            $newchunk=$newchunk.$negentry."\t".$neglemma."\t".$pos1."\
+t".$id1."\t".$ref1."\t".$mod1."\n";
        }
        else {
                $newchunk=$newchunk.$negvalues."\n";
            }
    }
            

        
        $HOE{$i}="${newchunk}";
    #    print "$newchunk\n";
        $newchunk="";
    }
    
    $aggettivo=0;
    $negazione=0;
    
}


while (my ($k,$v) = each %HOH ) {
    print OUTPOS "$v\n";
}

    
while (my ($ka,$va) = each %HOE ) {
        print OUTNEG "$va";
    }
[download]

And I used the following command to launch it:

gunzip -c bnc.xml.gz.Z|perl provalong.pl testlist.txt

where bnc.xml.gz.Z is the file I should analyze. It is approximately 691 MB

Any idea on why this is not working? Any idea on how to fix this would be really appreciated.

Thanks in advance

Comment on Handling very big gz.Z files Select or Download Code

Replies are listed 'Best First'.
Re: Handling very big gz.Z files by mildside (Friar) on Feb 06, 2013 at 00:44 UTC
Presumably you can look at the contents of the compressed file ok? I mean that something like `gunzip -c bnc.xml.gz.Z \| more` gives you what you are expecting to see as the input data?	[reply] [d/l]
Re^2: Handling very big gz.Z files by albascura (Novice) on Feb 06, 2013 at 06:53 UTC
That is the problem, it does.. Datas are the very same format I used in my small test file.	[reply]
Re: Handling very big gz.Z files by mbethke (Hermit) on Feb 07, 2013 at 05:07 UTC
Hi albascura, your problem is in this line: `my $dom = Mojo::DOM->new->parse(scalar do { local $/; <STDIN> });` [download] You're just exhausting your memory there---slurping the whole file that has a couple of gigabytes uncompressed is already likely to bring common desktops to their limit, and then building the DOM on that will fail on anything but the biggest irons. To fix this, you'll have to say goodbye to the convenient all-in-memory DOM, but luckily XML::Twig allows for almost the same convenience with almost the low memory consumption of an XML stream parser like XML::SAX. Something like this should do it: `use XML::Twig; use IO::Handle; my $stdin = IO::Handle->new(); $stdin->fdopen(fileno(STDIN),"r") or die "fdopen STDIN: $!"; XML::Twig->new( twig_handlers => { 's' => \&process_sentence } )->safe_parse($stdin); sub process_section { my( $t, $elem) = @_; ... $elem->purge; # don't want to print the original text }` [download] `$t->text` in the `process_section` callback returns the sentence text so it should be equivalent to `$chunk` in your `for my $chunk ( $dom->find('s')->each ) {` loop except that it doesn't include the start/end tags and drops any tags that might appear within the sentence. I used to work with the BNC at university but almost always via SARA so I can't remember specifics about its markup. ISTR that they used some weird entities though so maybe you have to use the `keep_encoding< code> option to <code>new()`	[reply] [d/l] [select]
Re: Handling very big gz.Z files by flexvault (Monsignor) on Feb 06, 2013 at 16:43 UTC
Welcome albascura, Using 'more' will take a long time to see that the format is correct at the end of the file. I don't use 'more', but I use 'pg' to display a page at a time. So 'more' may have an option to display the end of the file, but I would do the following: `gunzip -c bnc.xml.gz.Z > vi.testlist # I use 'vi.' for any temp + lists tail -100 vi.testlist \| more # Check if end of file is +correct cat vi.testlist \| perl provalong.pl testlist.txt rm vi.testlist # clean-up` [download] If it works then you can try the original, and if that doesn't work, you have at least a temporary work around until you find the specific problem. Note: My background of AIX, '.Z' is used by the 'compress/uncompress' system commands and '.gz' is used with 'gzip/gunzip' system commands. Are you sure that the file wasn't created that way? 'compress' gets a 10% additional compression over 'gzip' and when disk drives were small, was a big deal. Today is not worth the CPU cycles. Good Luck...Ed "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]
Re^2: Handling very big gz.Z files by mbethke (Hermit) on Feb 07, 2013 at 05:16 UTC
My background of AIX, '.Z' is used by the 'compress/uncompress' system commands and '.gz' is used with 'gzip/gunzip' system commands. Are you sure that the file wasn't created that way? 'compress' gets a 10% additional compression over 'gzip' and when disk drives were small, was a big deal. Today is not worth the CPU cycles. OT: I've yet to see the file that compress crunches to a smaller size than gzip. Actually I thought for a long time (before I heard of the patents) that everyone had ditched compress for gzip because compress sucks so badly in comparison. Today, people burn a lot more CPU cycles using lzma, xz & Co. for a much better compression than either.	[reply]
Re^3: Handling very big gz.Z files by flexvault (Monsignor) on Feb 07, 2013 at 12:12 UTC
mbethke, I think we agree! What I referred to is that 'gzip' does great in compressing text, and the result is a binary file. Now that file can be compressed further by 'compress'. But I haven't done that since the RT or early RS\6000 days. I don't even know if 'compress' on AIX 6.1 or 7.1 exists( my in-house box with AIX 5.2 has it ), but I found it "funny" to see the ".qz.Z" and remembered when it was done. I pointed it out in case the file was being created differently then the OP thought. I just fired up last week a Debian AMD box with 8-core and 4-2TB drives. Why bother with compression! Regards...Ed "Well done is better than well said." - Benjamin Franklin	[reply]
Re^4: Handling very big gz.Z files by mbethke (Hermit) on Feb 07, 2013 at 16:08 UTC