fadingjava has asked for the wisdom of the Perl Monks concerning the following question:

hi monks, i am searching for a keyword in big file (536 mb). What i want as a result of this search is not only the keyword, but five words before and after the occurence of the keyword. What i do now is to input each line of the file into a string and split the string into individual words. Then i match the query word to each of these words from the file. Since they are in an array i extract five words before and after the keyword from the array and join them to make the result string . This approach works fine when i run this on a small file . But for size of file i am working with , this approach becomes very slow. I run this as a part of cgi script , so i get timeouts when the query word is actually towards the middle of file or much deeper. Can anybody tell me a better way to do it , or correct my approach , given the size of this file. I have not included the result string formation here.here is the code i use
an array @morgr is created here and shuffled to look for the word in r +andom lines of the file open (MORG, "< /home/sid/kwicionary/$filename") or die "Can't open $fi +lename for reading: $!"; $path = "/home/sid/kwicionary/"; $indexname = "$path$filename.index"; sysopen(MIDX,$indexname, O_CREAT|O_RDONLY) or die "Can't open $indexna +me for read/write:$!"; build_index(*MORG, *MIDX) if -z $indexname; MORG:foreach (@morgr){ $line_number = $_; $line = line_with_index(*MORG,*MIDX,$line_number); if ($key =~ /\*/){ $key = substr($key,0,(length($key) -1)); @word = split(' ',$line); foreach (@word){ if ($_ =~ /$key/){ push (@morg_results , $_); } } }

Replies are listed 'Best First'.
Re: searching for a keyword with context window
by BrowserUk (Patriarch) on Nov 03, 2004 at 01:20 UTC

    If your file contains paragraphs (ie. blank lines between reasonable sized block of text, something like this might work. It is pretty quick, processing a 587 MB test file below (replicated several dozen times to get a file of comparable size) in under a minute.

    You can adjust the definition of a 'word' to suit. I've specified 1 to 5 words either side to account for the word being at the start or end of a paragraph.

    #! perl -sw use strict; our $WORDS ||= 1; our $KEYWORD || die "-KEYWORD=word needed"; local $/ = ''; # Paragraph mode my $re_word = qr[\S+\s+]; my $re_5w_key_5w = qr[ ( $re_word {1,$WORDS} \Q($KEYWORD\E[,.;:!?]*\s+ $re_word {1,$WORDS} ) ]ix; open IN, '<', $ARGV[ 0 ] or die $!; while( <IN> ){ if( $_ =~ $KEYWORD ) { while( $_ =~ m[$re_5w_key_5w]g ) { print "'$1'\n---\n"; } } }

    Results:

    [ 0:58:39.10] P:\test>404751 -WORDS=1 -KEYWORD=poverty "Rhetoric - Ari +stotle.txt" 'or poverty; it ' --- 'or poverty but ' --- 'or poverty, of ' --- 'or poverty or ' --- 'his poverty, a ' --- 'in poverty or ' --- [ 0:58:50.09] P:\test>404751 -WORDS=3 -KEYWORD=poverty "Rhetoric - Ari +stotle.txt" 'to wealth or poverty; it is of ' --- 'to wealth or poverty but to appetite. ' --- 'of wealth or poverty, of being lucky ' --- 'by sickness or poverty or love or ' --- 'disregard of his poverty, a man aging ' --- 'by us in poverty or in banishment, ' --- [ 0:58:53.10] P:\test>404751 -WORDS=5 -KEYWORD=poverty "Rhetoric - Ari +stotle.txt" 'action due to wealth or poverty; it is of course true ' --- 'due not to wealth or poverty but to appetite. Similarly, with ' --- 'the sense of wealth or poverty, of being lucky or unlucky. ' --- 'are afflicted by sickness or poverty or love or thirst or ' --- 'man by disregard of his poverty, a man aging war by ' --- 'who stand by us in poverty or in banishment, even if ' ---

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: searching for a keyword with context window
by TedPride (Priest) on Nov 03, 2004 at 02:55 UTC
    Reading the file in large chunks is more efficient than going line by line, but then you have to worry about your chunks breaking the target word into more than one piece, which means you have to find the last word piece of each chunk and add it to the beginning of the next chunk, which lowers efficiency. When and if you do find the word, you have to file seek to a place x amount before it, read in the word plus the buffer zone around it, and extract the other 10 words, which means your script is also slow if there are numerous matches. I got about 106 seconds using the following on a 712 MB file containing the target word at the very end:
    use strict; use warnings; my $fname = 'test.dat'; my $csize = 102400; my $word = 'bingo'; my ($handle, $length, $end, $pos, $c); open($handle, $fname); while () { $length = read($handle, $_, $csize); if (index($end.$_, $word) != -1) { $pos = tell($handle) - $length + index($end.$_, $word) - 100; $pos = 0 if $pos < 0; seek($handle, $pos, 0); read($handle, $_, 220); for (@_ = split(/\W+/)) { if ($_ eq $word) { print join(' ', @_[(($c > 4) ? $c - 5 : 0)..(($c < $#_ + - 3) ? $c + 5 : $#_)]); last; } $c++; } last; } last if (!$length); ($end) = m/[\W](\w*$)/; } close($handle);
    Decidedly inefficient, and I haven't provided for multiple matches yet either. It would probably be much better to index the file for all words occuring less than x number of times, and/or use a system utility to find the locations of matches. Even 30 seconds is unacceptable, and under 5 would be a lot better.
Re: searching for a keyword with context window
by chromatic (Archbishop) on Nov 02, 2004 at 22:54 UTC

    I'd hate to code the search algorithm myself efficiently, so I might shell out to grep -A 1 -B 1 keyword filename, join the lines together, and do the five word windowing on the results.

Re: searching for a keyword with context window
by ikegami (Patriarch) on Nov 02, 2004 at 23:02 UTC

    I'm guessing your kwicionary is rather constant. If so, you can build an index (word => file position) outside of the CGI script, and store it to disk. There must be algorithms out there to build index files which don't need to be completely loaded into memory, but I'm not familiar with them. On second thought, a database would be perfect to store the index.

    The CGI script could

    1. Locate the position of the word using the index file/table.
    2. Open the dictionary file.
    3. Seek to the position found in the index file.
    4. Read past the word.
    5. Read five words. Those are your Next Five Words.
    6. Reopen the dictionary file with File::ReadBackwards.
    7. Seek to the position found in the index file.
    8. Read five words. Those are your Previous Five Words.