Quicksilver has asked for the wisdom of the Perl Monks concerning the following question:

I crave somebody's benevolence in trying to improve something in this script. The code finds the number of words and which line the individual words is on bt I'm also trying to get the surrounding words. I'm aware that this will cause performance issues but the data will eventually go into a database.
foreach $line ( @theseLines ) { $count++; $line = lc $line; $line =~ s/[.,:;?!]//g; while ( $line =~ /(\p{Alnum}+([-']\p{Alnum}+)*)/g ) { $word = $&; if ( $word =~ /\s/ || $word eq "") { next } $Count{$word}++; if ( defined $Line{$word} ) { $Line{$word} =~ m/(\d*?)$/; if ( $1 == $count ) { next; } else { $Line{$word} .= ", $count"; } } else { $Line{$word} = $count; } } } <br /> <br /> @theseWords = keys %Line; @theseWords = sort @theseWords; foreach $word ( @theseWords ) { my @lines = split m/,/, $Line{$word}; for (@lines){ print ("$word, $_"); } }
What I'd like to get, if "Mary had a little lamb" was on line 8 and I'm searching in little, is:
little, 8, a little lamb.
Thanks in advance for any help.

Replies are listed 'Best First'.
Re: Finding word either side of a word match
by moritz (Cardinal) on Mar 03, 2008 at 13:48 UTC
    It seems from your questions that you are building something like a search engine. Maybe How to build a Search Engine. is of interest for you (a very good node IMHO), and maybe this article on perl.com.

    If you insist on creating your index manually, here is what you could do: in the while loop create a list of all found words, and then iterate over all of them again:

    # assuming you stored all matches in @words my %word_context; for (1 .. $#words-1){ push @{$word_context{$words[$_]}, [$words[$_ - 1], $words[$_ + 1]]; }

    Though it might be better to store context and line number in the same data structure.

    You also have to think about the first and the last word, which are special in that they don't have two words of context each. What do you want to do with them?

      Thanks for the articles which I've skimmed over to read fully later. What I'm trying to do is to create a concordance which allows the user to search a text and then find all the occurences of a word and some sample text to work out if that's the section they are looking for and where it is in the text.

      Its part of a personal project to try and create some useful textual analytical tools. Also it seemed like a good way to extend my nascent knowledge of Perl into something practical whilst learning. I'll need to think about those two words.
        If you want to display context, then there's a better solution: For each each word store the position of the word in the file (in bytes) in the DB. When you want to show the context, you just seek that position (or let's say $position - 20), and read the next few bytes.

        That way you have to keep the indexed files at hand, but you avoid storing every word thrice in the DB.

Re: Finding word either side of a word match
by amarquis (Curate) on Mar 03, 2008 at 13:57 UTC

    In addition the big picture issues moritz talks about, here are a couple of little things to think about:

    @theseWords = keys %Line; @theseWords = sort @theseWords; foreach $word ( @theseWords ) {

    You don't have to create a temporary array for each step here, you can just write:

    foreach my $word ( sort keys @theseWords) {

    Also, it looks like you've slurped the whole file into an array of lines at the beginning, and for large files you'd probably be better off just processing directly line by line.

Re: Finding word either side of a word match
by thundergnat (Deacon) on Mar 03, 2008 at 15:37 UTC

    Updated to use less memory, and make it easier to do some Markov chain analysis should you so desire. Tweaked regex for some corner cases.

    You may just want to build your index and get it over with. Note: this will use, as a wild guess, at least 3 times as much memory as the file size, so be careful with large files.

    use warnings; use strict; my %index; while ( my $line = <DATA> ) { $line = lc $line; $line =~ s/^\P{Alnum}+|\P{Alnum}+$//g; my @words = split /\P{Alnum}*\s\P{Alnum}*/, $line; for ( 0 .. $#words ) { my $word = $words[$_]; $index{$word}{count}++; my ( $pre, $post ) = ( '', '' ); if ( $_ > 0 ) { $pre = $words[ $_ - 1 ]; } if ( $_ < $#words ) { $post = $words[ $_ + 1 ]; } push @{ $index{$word}{lines} }, [ $., $pre, $post ]; } } for my $word ( sort keys %index ) { print "$word - $index{$word}{count} time" . ( $index{$word}{count} == 1 ? '' : 's' ) . ":\n"; printf " Line %4d - %s $word %s\n", @$_ for ( @{ $index{$word}{ +lines} } ); print "\n"; } __DATA__ Mary had a little lamb, A little pork, a little jam, A little fish, some kangaroo, A pudding and some cookies too, An ice cream soda topped with fizz, And boy how sick our Mary is. Mary had a little lamb, Her daddy shot it dead. And now it goes to school with her, Between two hunks of bread.
Re: Finding word either side of a word match
by BrowserUk (Patriarch) on Mar 03, 2008 at 14:08 UTC

    It might be a bit quicker using a regex:

    #! perl -slw use strict; my $word = 'little'; while( <DATA> ) { m[( [^ \n]*[^\w\n]* (\b\Q$word\E\b) [^\w\n]*[^ \n]* )]x and print "$2, $., $1"; } __DATA__ little boy blue come blow on your horn give a little, take a lot give a lot, take a little Mary had a little lamb Don't belittle the little guy

    Gives:

    [14:04:33.30]C:\test>junk4 little, 1, little boy little, 2, a little, take little, 3, a little little, 4, a little lamb little, 5, the little guy

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I don't think that's quicker, in terms of run time.

      Quicksilver wants to do it for every word in the file, which means that using a regex to search for it will scale like O(nē) (n being the length of the input file) in time, while the traversal of the files is possible in O(n).

        You're right. I completely misread the question. split seems a likely candidate:

        #! perl -slw use strict; while( <DATA> ) { my @words = split /[^a-zA-Z']+/; for ( 0 .. $#words ) { printf "%s, $., %s %s %s\n", $words[ $_ ], $_ ? $words[ $_ -1 ] : '', $words[ $_ ], $_ < $#words ? $words[ $_ + 1 ] : '' ; } } __DATA__ little boy blue come blow on your horn give a little, take a lot give a lot, take a little Mary had a little lamb Don't belittle the little guy

        Outputs

        [14:37:07.78]C:\test>junk4 little, 1, little boy boy, 1, little boy blue blue, 1, boy blue come come, 1, blue come blow blow, 1, come blow on on, 1, blow on your your, 1, on your horn horn, 1, your horn give, 2, give a a, 2, give a little little, 2, a little take take, 2, little take a a, 2, take a lot lot, 2, a lot give, 3, give a a, 3, give a lot lot, 3, a lot take take, 3, lot take a a, 3, take a little little, 3, a little Mary, 4, Mary had had, 4, Mary had a a, 4, had a little little, 4, a little lamb lamb, 4, little lamb Don't, 5, Don't belittle belittle, 5, Don't belittle the the, 5, belittle the little little, 5, the little guy guy, 5, little guy

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.