Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a very simple code which is trying to read in a big text file (150 M) and extract proper names from it. My problem is: the program bombs out mid way saying "Out of Memory!"
A snippet from the code goes like
$fname = "../all"; open(TEXT, "<$fname")|| die "could not open file: $fname\n"; my @words =(); while (<TEXT>) { $txt .= $_; } @words = split (/[ +\n+\,\:]/, $txt); $len = @words; print "LEN = $len\n"; close (TEXT);
I never get to see the len output, it stops before that.
Can anyone suggest something please?

Thanks
J

Replies are listed 'Best First'.
Re: Out of memory
by jmcnamara (Monsignor) on Aug 19, 2002 at 11:56 UTC

    It is inefficient to read a file this large into memory. Instead you should process the file line-by-line:
    my @words; my $len; while (<TEXT>) { $len += @words = split /\s+|[,:]/; } print "LEN = $len\n";

    The regex here is a guess at what you might need.

    --
    John.

Re: Out of memory
by gmpassos (Priest) on Aug 19, 2002 at 12:14 UTC
    The problem is with the variable $txt! You are loading all the file inside the $txt and only after this you are counthing the "words".

    Other tinhg, your regexp (RE) are wrong! [] is to make a group of characters! For example, \w is the same of [a-zA-Z0-9], and you don't need to put the "\" before "," and ":" too. I thing that the right RE inside split is: / +\n+,:/

    Since you are using a big file (150M) win speed is good too! Don't read the file using <TEXT>, use read() or sysread(), because <> will scan the data and look for \n, and only after this return the data!

    You dont need to use this: my @array = (); to clean the array! Because when you make: my @array ; you already have cleanded the array!

    Try this code for your script:

    $fname = "$0"; open(TEXT, "<$fname")|| die "could not open file: $fname\n"; ## Let's read 100KB per time. my $buffer_size = 1024*100 ; my ($buffer,$words) ; while (sysread TEXT, $buffer, $buffer_size) { ## This will count fast for you: my @c = ($buffer =~ /( +\n+,:)/gs) ; $words += @c ; } close TEXT; print "Length: $words\n" ; exit;
    "The creativity is the expression of the liberty".
      Hi guys,

      Thanks for the input - both piece of code (from both monks)worked! Now I know to ditch my books and come here for enlightenment. But its not just the number of words I'm after, I need to have the file content in memory to parse and extract all proper names (i.e 2 or more consecutive words)from it.

      I guess I'll have to go the sysread way.

      my $buffer ; while (sysread TEXT, $buffer, $buffer_size) { ## This will use tr to count fast for you: $words += ($buffer =~ tr/ +\n+,://) ; }

      I need to store $buffer in an array and then process it word by word. Is there any efficient way of doing that?

      My entire code (if I dare show :)) looked something like this before:

      $fname = "haystack.test"; open(TEXT, "<$fname")|| die "could not open file: $fname\n"; while (<TEXT>) { $txt .= $_; } @words = split (/[ +\n+\,\:]/, $txt); $len = @words; print "LEN = $len\n"; close (TEXT); $i =0; while( $i< $len) { my $flag2 = 1; my $sptr = my $eptr = $words[$i]; if($sptr =~ /^[A-Z][a-z]+/ ) { $eptr = $words[$i+1] ; if($eptr =~ /^[A-Z][a-z]*/ && $i< $len) { $i++; $sptr = $words[$i]; $eptr = $words[$i+1] ; $flag2 = 0; while($eptr =~ /^[A-Z][a-z]*/ && $i < $len) { $i++; #print "I =$i\n"; $sptr = $words[$i] ; $eptr = $words[$i+1] ; } if (flag2 ne 1) { print"\n";} } else {$i++; } else { $i++;} } print"\n";

      So do you think i'll be alright loading all the words in an array? Or is there a better way?

      Thanks
      J

        I'm a bit confused. Several people have suggested you read in the file line-by-line. And now you come with sysread. It will work with sysread, but not the way you do - because if you sysread halfway in a word (so the other half will be read in the iteration), you'll count a word twice. You would need to keep track of what was at the end of the previous read, and compare that with what's at the beginning of the next read.

        So, why can't you process the file line-by-line?

        You also say "I need to store $buffer in an array and then process it word by word". Why, oh why? It's certainly not going to solve your out of memory error. As people have indicated, that's where the root of your problem is - trying to store everything in memory.

        I suggest you either follow the given advice, or you buy some more memory, because you will need to buy more if you insist of storing the entire file in memory. And keep some cash ready, you need to buy more if your file increases.

        Abigail

        If your input data is such that just finding two or more Consecutive Capitalized Words is sufficient to identify proper names, then your logic can be very "local" as well as simple -- you wouldn't need more than two lines of text in memory at any one time with Perl. But avoiding false alarms and misses takes some careful thought. Aside from the two false alarms in my first two sentences, your current method would miss the next case completely. (Mike O'Conner, whoever he is, would be missed if he's mentioned at the beginning of a quotation, as well as a parenthetical. And what about George McGovern?)

        Leaving that aside, this may be one of those rare cases where a "goto" statement is worth having. Consider:

        use strict; my @tokens = (); my $properName = ""; while (<>) { push @tokens, split; FINDCAP: # the "until" loop skips non-capitalized tokens until ( @tokens == 0 or $tokens[0] =~ /^\W*[A-Z][\'A-Za-z]*\b/ ) { shift @tokens; } # the "while" loop accumulates consecutive capitalized toke +ns, # if any were found that caused us to break out of the "unt +il" while ( @tokens and $tokens[0] =~ /^\W*([A-Z][\'A-Za-z]*)\b/ ) { $properName .= $1 . " "; shift @tokens; } # go into the next block if there are still tokens left # (this means we haven't reached the end of this line) if ( @tokens ) { if ( $properName =~ / [A-Z]/ ) { print $properName,$/; # print if $properName has >1 word } $properName = ""; # reset to empty string goto FINDCAP; # look through the remainder of this line } # that block was skipped if there are no tokens left # so we loop back to the outer "while" loop to get the # next line, and append its words to @tokens # (and $properName, if not empty, remains intact # for appending the next Capitalized Token, if any) }
        Because it ignores punctuation, this will do the wrong thing in a case like "I saw John Brown. He was dead." Good luck! (update: enhanced the final block of commentary in the code, and made the title more relevant)
Re: Out of memory
by Abigail-II (Bishop) on Aug 19, 2002 at 11:56 UTC
    That's because you want to read in the entire file into memory - while doing it in an inefficient way.

    Why not count the number of words on a line by line basis? BTW, your regex /[ +\n+\,\:]/ isn't doing what you think it's doing.

    Abigail

      Hi,

      Thanks for the reply.

      That was precisely my question . I guess there are more efficient ways of reading the file that would alleviate my memory problem - and that's where I need the help. What other ways can I read it in?

      Also, u got me scared about the regex /[ +\n+\,\:]/ bit. I thought I was splitting on space/s , newline/s, comma and colon. Am I not ? I'm a Perl novice and any suggestions are much appreciated.

      Thanks
      J

        Oh, you are reading it in fine - line by line. But then you join all the lines together until you have read all the lines, before you count the words.

        Why not count the words of the line just read, and then read in the next line?

        As for your regex, you split on a space, a plus, a newline, a plus (again), a comma or a semi-colon. I think you want: /[\s,:]+/.

        Abigail