in reply to Re: Out of memory
in thread Out of memory

Hi guys,

Thanks for the input - both piece of code (from both monks)worked! Now I know to ditch my books and come here for enlightenment. But its not just the number of words I'm after, I need to have the file content in memory to parse and extract all proper names (i.e 2 or more consecutive words)from it.

I guess I'll have to go the sysread way.

my $buffer ; while (sysread TEXT, $buffer, $buffer_size) { ## This will use tr to count fast for you: $words += ($buffer =~ tr/ +\n+,://) ; }

I need to store $buffer in an array and then process it word by word. Is there any efficient way of doing that?

My entire code (if I dare show :)) looked something like this before:

$fname = "haystack.test"; open(TEXT, "<$fname")|| die "could not open file: $fname\n"; while (<TEXT>) { $txt .= $_; } @words = split (/[ +\n+\,\:]/, $txt); $len = @words; print "LEN = $len\n"; close (TEXT); $i =0; while( $i< $len) { my $flag2 = 1; my $sptr = my $eptr = $words[$i]; if($sptr =~ /^[A-Z][a-z]+/ ) { $eptr = $words[$i+1] ; if($eptr =~ /^[A-Z][a-z]*/ && $i< $len) { $i++; $sptr = $words[$i]; $eptr = $words[$i+1] ; $flag2 = 0; while($eptr =~ /^[A-Z][a-z]*/ && $i < $len) { $i++; #print "I =$i\n"; $sptr = $words[$i] ; $eptr = $words[$i+1] ; } if (flag2 ne 1) { print"\n";} } else {$i++; } else { $i++;} } print"\n";

So do you think i'll be alright loading all the words in an array? Or is there a better way?

Thanks
J

Replies are listed 'Best First'.
Re: Out of memory
by Abigail-II (Bishop) on Aug 19, 2002 at 13:21 UTC
    I'm a bit confused. Several people have suggested you read in the file line-by-line. And now you come with sysread. It will work with sysread, but not the way you do - because if you sysread halfway in a word (so the other half will be read in the iteration), you'll count a word twice. You would need to keep track of what was at the end of the previous read, and compare that with what's at the beginning of the next read.

    So, why can't you process the file line-by-line?

    You also say "I need to store $buffer in an array and then process it word by word". Why, oh why? It's certainly not going to solve your out of memory error. As people have indicated, that's where the root of your problem is - trying to store everything in memory.

    I suggest you either follow the given advice, or you buy some more memory, because you will need to buy more if you insist of storing the entire file in memory. And keep some cash ready, you need to buy more if your file increases.

    Abigail

      Hi Abigail and all,

      I think our mails must 've crossed at the same time - I didn't get to read your reply before I posted mine.

      Yes, I see your point in reading it line by line - and i've been testing on that right now. The problem is some sentences get split midway -they probably have a new line marker there. As a result I lose the info. Say for "Robert L. Stevenson" - Robert goes in the first line and L. Stevenson in the next. I could probably get around it by storing 3 lines (before, after and current) at any one time and ease the overhead.

      Thanks for the suggestions and bearing with a dufus like me :-) J

      Hi AbigailII,

      looks like our mails crossed at the same time - I didn't get to read your reply before posting mine.

      Yes, I realise now that the best way to go is to read in line by line and that's what i've been trying now. The reason why i've been clamouring on about keeping all the words in the memory was because on reading the file linewise, some sentences get split midway and I lose the info . Say for "John F. Kennedy",the John bit is in one line and F. Kennedy in the next.

      I guess i'll have to get around that problem by storing atleast3 consecutive lines in memory at any one time?

      Thanks for all the suggestions.
      J

Looking for proper names in text
by graff (Chancellor) on Aug 20, 2002 at 04:13 UTC
    If your input data is such that just finding two or more Consecutive Capitalized Words is sufficient to identify proper names, then your logic can be very "local" as well as simple -- you wouldn't need more than two lines of text in memory at any one time with Perl. But avoiding false alarms and misses takes some careful thought. Aside from the two false alarms in my first two sentences, your current method would miss the next case completely. (Mike O'Conner, whoever he is, would be missed if he's mentioned at the beginning of a quotation, as well as a parenthetical. And what about George McGovern?)

    Leaving that aside, this may be one of those rare cases where a "goto" statement is worth having. Consider:

    use strict; my @tokens = (); my $properName = ""; while (<>) { push @tokens, split; FINDCAP: # the "until" loop skips non-capitalized tokens until ( @tokens == 0 or $tokens[0] =~ /^\W*[A-Z][\'A-Za-z]*\b/ ) { shift @tokens; } # the "while" loop accumulates consecutive capitalized toke +ns, # if any were found that caused us to break out of the "unt +il" while ( @tokens and $tokens[0] =~ /^\W*([A-Z][\'A-Za-z]*)\b/ ) { $properName .= $1 . " "; shift @tokens; } # go into the next block if there are still tokens left # (this means we haven't reached the end of this line) if ( @tokens ) { if ( $properName =~ / [A-Z]/ ) { print $properName,$/; # print if $properName has >1 word } $properName = ""; # reset to empty string goto FINDCAP; # look through the remainder of this line } # that block was skipped if there are no tokens left # so we loop back to the outer "while" loop to get the # next line, and append its words to @tokens # (and $properName, if not empty, remains intact # for appending the next Capitalized Token, if any) }
    Because it ignores punctuation, this will do the wrong thing in a case like "I saw John Brown. He was dead." Good luck! (update: enhanced the final block of commentary in the code, and made the title more relevant)