albascura has asked for the wisdom of the Perl Monks concerning the following question:
Hi everyone
I came across a problem today and I have no idea what is causing it.
Basically I wrote a script some time ago that takes in input a .txt file which has some xml tags inside and check for stuff in between the <s></s> tag, doing some stuff. I was pretty satisfied (thanks again for all the help I received here) since it worked pretty well on the small file I used as text. But today I tried running it on a big file (which, btw is gz.Z compressed) which has the very same format of the small file I used as test. And surprisingly it returns two empty files, instead of two full files as I got on the small file.
Here is the code I wrote
#! /usr/bin/perl -w use strict; use warnings; use Mojo::DOM; use File::Slurp 'slurp'; use Data::Dumper; my $fileParole = shift; my $neg="not"; my $neg1="no"; my $debug=shift; my %HOH; my %HOA; my %HOE; my $dom = Mojo::DOM->new->parse(scalar do { local $/; <STDIN> }); my($entry,$lemma,$pos,$id,$ref,$mod); my @negval; my $negvalues; my $notref; #my $ref1; my $parolaNO=0; my($entry1,$lemma1,$pos1,$id1,$ref1,$mod1); my $neglemma; my $newchunk=""; my $negentry; my $i = 0; my $line; my @values; my @fields; my $el; my $parolaNeg; open (INPUT2,$fileParole) or die $!; my $filepos = "positive.txt"; my $fileneg ="negative.txt"; open OUTPOS,">$filepos"; open OUTNEG,">$fileneg"; while (<INPUT2>) { my ($word1, $word2)= split('\t', $_); chomp ($word1); chomp ($word2); $HOA{$word1} ="${word2}"; $HOA{$word2} ="${word1}"; } my $negazione=0; my $aggettivo=0; my $idagg; my $idneg; my $parolaAnalisi=""; for my $chunk ( $dom->find('s')->each ) { $i++; @values = split('\n', $chunk); foreach my $val(@values){ if (($val =~ m/<s>/)||($val =~ m/<\/s>/)) { next; } ($entry,$lemma,$pos,$id,$ref,$mod)= split('\t', $val); #$parolaAnalisi=$lemma; $entry =~ s/^\s+|\s+$//g; $lemma =~ s/^\s+|\s+$//g; $pos =~ s/^\s+|\s+$//g; #print "LEMMA: ".$lemma. " REF: ".$ref."\n"; if(($lemma =~ m/^no$/)||($lemma =~ m/^not$/)){ $idneg=$ref; } if ((exists $HOA{$lemma})&&($pos=~ m/^JJ$/)) { $parolaAnalisi=$lemma; $aggettivo=1; $idagg=$ref; $parolaNeg=$lemma; #print "ID AGG= ".$idagg."\n"; #print "ID NEG= ".$idneg."\n"; if ($idneg == $idagg) { $negazione=1; #next; #print "NEGAZIONE TROVATA\n"; } } } #$i++; if (($aggettivo==1)&&($negazione==0)) { $HOH{$i}="${chunk}"; } #print "I: ".$i."\n"; if (($aggettivo==1)&&($negazione==1)) { foreach $negvalues(@values){ chomp $negvalues; ($entry1,$lemma1,$pos1,$id1,$ref1,$mod1)= split('\t', $negvalu +es); #print $negvalues."\n"; if (($negvalues =~ m/not/)||($negvalues =~ m/no/)) { next; } if($negvalues =~ m/$parolaAnalisi/){ #print "NOTREF: ".$notref."\n"; $negentry= +"not_".$entry1; $neglemma="not_".$lemma1; #print "TRUE\n"; $newchunk=$newchunk.$negentry."\t".$neglemma."\t".$pos1."\ +t".$id1."\t".$ref1."\t".$mod1."\n"; } else { $newchunk=$newchunk.$negvalues."\n"; } } $HOE{$i}="${newchunk}"; # print "$newchunk\n"; $newchunk=""; } $aggettivo=0; $negazione=0; } while (my ($k,$v) = each %HOH ) { print OUTPOS "$v\n"; } while (my ($ka,$va) = each %HOE ) { print OUTNEG "$va"; }
And I used the following command to launch it:
gunzip -c bnc.xml.gz.Z|perl provalong.pl testlist.txtwhere bnc.xml.gz.Z is the file I should analyze. It is approximately 691 MB
Any idea on why this is not working? Any idea on how to fix this would be really appreciated.
Thanks in advance
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Handling very big gz.Z files
by mildside (Friar) on Feb 06, 2013 at 00:44 UTC | |
by albascura (Novice) on Feb 06, 2013 at 06:53 UTC | |
|
Re: Handling very big gz.Z files
by mbethke (Hermit) on Feb 07, 2013 at 05:07 UTC | |
|
Re: Handling very big gz.Z files
by flexvault (Monsignor) on Feb 06, 2013 at 16:43 UTC | |
by mbethke (Hermit) on Feb 07, 2013 at 05:16 UTC | |
by flexvault (Monsignor) on Feb 07, 2013 at 12:12 UTC | |
by mbethke (Hermit) on Feb 07, 2013 at 16:08 UTC |