Re: Out of memory
by jmcnamara (Monsignor) on Aug 19, 2002 at 11:56 UTC
|
It is inefficient to read a file this large into memory. Instead you should process the file line-by-line:
my @words;
my $len;
while (<TEXT>) {
$len += @words = split /\s+|[,:]/;
}
print "LEN = $len\n";
The regex here is a guess at what you might need.
--
John.
| [reply] [d/l] |
Re: Out of memory
by gmpassos (Priest) on Aug 19, 2002 at 12:14 UTC
|
The problem is with the variable $txt! You are loading all the file inside the $txt and only after this you are counthing the "words".
Other tinhg, your regexp (RE) are wrong! [] is to make a group of characters! For example, \w is the same of [a-zA-Z0-9], and you don't need to put the "\" before "," and ":" too. I thing that the right RE inside split is: / +\n+,:/
Since you are using a big file (150M) win speed is good too! Don't read the file using <TEXT>, use read() or sysread(), because <> will scan the data and look for \n, and only after this return the data!
You dont need to use this: my @array = (); to clean the array! Because when you make: my @array ; you already have cleanded the array!
Try this code for your script:
$fname = "$0";
open(TEXT, "<$fname")|| die "could not open file: $fname\n";
## Let's read 100KB per time.
my $buffer_size = 1024*100 ;
my ($buffer,$words) ;
while (sysread TEXT, $buffer, $buffer_size) {
## This will count fast for you:
my @c = ($buffer =~ /( +\n+,:)/gs) ;
$words += @c ;
}
close TEXT;
print "Length: $words\n" ;
exit;
"The creativity is the expression of the liberty". | [reply] [d/l] [select] |
|
|
Hi guys,
Thanks for the input - both piece of code (from both monks)worked!
Now I know to ditch my books and come here for enlightenment.
But its not just the number of words I'm after,
I need to have the file content in memory to parse and
extract all proper names (i.e 2 or more consecutive words)from it.
I guess I'll have to go the sysread way.
my $buffer ;
while (sysread TEXT, $buffer, $buffer_size) {
## This will use tr to count fast for you:
$words += ($buffer =~ tr/ +\n+,://) ;
}
I need to store $buffer in an array and then process it word by word.
Is there any efficient way of doing that?
My entire code (if I dare show :))
looked something like this before:
$fname = "haystack.test";
open(TEXT, "<$fname")|| die "could not open file: $fname\n";
while (<TEXT>) { $txt .= $_; }
@words = split (/[ +\n+\,\:]/, $txt);
$len = @words;
print "LEN = $len\n";
close (TEXT);
$i =0;
while( $i< $len)
{
my $flag2 = 1;
my $sptr = my $eptr = $words[$i];
if($sptr =~ /^[A-Z][a-z]+/ )
{
$eptr = $words[$i+1] ;
if($eptr =~ /^[A-Z][a-z]*/ && $i< $len)
{
$i++;
$sptr = $words[$i];
$eptr = $words[$i+1] ;
$flag2 = 0;
while($eptr =~ /^[A-Z][a-z]*/ && $i < $len)
{
$i++;
#print "I =$i\n";
$sptr = $words[$i] ;
$eptr = $words[$i+1] ;
}
if (flag2 ne 1)
{ print"\n";}
}
else
{$i++;
}
else
{ $i++;}
}
print"\n";
So do you think i'll be alright loading all the words in an array?
Or is there a better way?
Thanks
J | [reply] [d/l] [select] |
|
|
I'm a bit confused. Several people have suggested you read
in the file line-by-line. And now you come with sysread.
It will work with sysread, but not the way you do - because
if you sysread halfway in a word (so the other half will be
read in the iteration), you'll count a word twice. You would
need to keep track of what was at the end of the previous read,
and compare that with what's at the beginning of the next read.
So, why can't you process the file line-by-line?
You also say "I need to store $buffer in an array and then process it word by word".
Why, oh why? It's certainly not going to solve your out of memory error.
As people have indicated, that's where the root of your problem is - trying
to store everything in memory.
I suggest you either follow the given advice, or you buy some
more memory, because you will need to buy more if you insist
of storing the entire file in memory. And keep some cash ready,
you need to buy more if your file increases.
Abigail
| [reply] [d/l] |
|
|
|
|
|
|
use strict;
my @tokens = ();
my $properName = "";
while (<>) {
push @tokens, split;
FINDCAP:
# the "until" loop skips non-capitalized tokens
until ( @tokens == 0 or $tokens[0] =~ /^\W*[A-Z][\'A-Za-z]*\b/ ) {
shift @tokens;
}
# the "while" loop accumulates consecutive capitalized toke
+ns,
# if any were found that caused us to break out of the "unt
+il"
while ( @tokens and $tokens[0] =~ /^\W*([A-Z][\'A-Za-z]*)\b/ ) {
$properName .= $1 . " ";
shift @tokens;
}
# go into the next block if there are still tokens left
# (this means we haven't reached the end of this line)
if ( @tokens ) {
if ( $properName =~ / [A-Z]/ ) {
print $properName,$/; # print if $properName has >1 word
}
$properName = ""; # reset to empty string
goto FINDCAP; # look through the remainder of this line
}
# that block was skipped if there are no tokens left
# so we loop back to the outer "while" loop to get the
# next line, and append its words to @tokens
# (and $properName, if not empty, remains intact
# for appending the next Capitalized Token, if any)
}
Because it ignores punctuation, this will do the wrong thing
in a case like "I saw John Brown. He was dead." Good luck!
(update: enhanced the final block of commentary in the code,
and made the title more relevant) | [reply] [d/l] |
Re: Out of memory
by Abigail-II (Bishop) on Aug 19, 2002 at 11:56 UTC
|
That's because you want to read in the entire file into
memory - while doing it in an inefficient way.
Why not count the number of words on a line by line basis?
BTW, your regex /[ +\n+\,\:]/ isn't doing what
you think it's doing.
Abigail | [reply] [d/l] |
|
|
Hi,
Thanks for the reply.
That was precisely my question .
I guess there are more efficient ways of reading the file
that would alleviate my memory problem - and
that's where I need the help.
What other ways can I read it in?
Also, u got me scared about the regex /[ +\n+\,\:]/ bit. I thought I was splitting on space/s , newline/s,
comma and colon. Am I not ? I'm a Perl novice and any
suggestions are much appreciated.
Thanks
J
| [reply] [d/l] |
|
|
Oh, you are reading it in fine - line by line. But then you
join all the lines together until you have read all the lines,
before you count the words.
Why not count the words of the line just read, and then read
in the next line?
As for your regex, you split on a space,
a plus, a newline,
a plus (again), a comma
or a semi-colon. I think you want:
/[\s,:]+/.
Abigail
| [reply] [d/l] |