in reply to Simple Text Indexing

My first recommendation was MMM::Text::Search, but it too only tells you which file(s) the search words were found in, not where.

What are you building this for? Wouldn't a simple brute force loop be enough if you are only dealing with one file?

my $match = qr/thunderbird/i; open FILE, '<', 'foo.txt' or die $!; while (<FILE>) { if ($_ =~ $match) { my @word = split /\s+/, $_; $_ =~ s/[A-Za-z0-9_ ]//g; for my $i (0..$#word) { if ($word[$i] =~ $match) { print "match found on line $. word ", $i+1,"\n"; } } } }

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Replies are listed 'Best First'.
Re: Re: Simple Text Indexing
by cyocum (Curate) on Nov 29, 2003 at 18:16 UTC

    Good point. I am a graduate student at the University of Edinburgh's Celtic Studies department. Most of what I am working with is from the CELT site (here). Their search engine does not cover the annals (look for the cronicon scottorum or annals of ulster in the published section) and is horribly slow along with many other bugs so I need to be able to quickly search and find stuff by keyword not only for cronicon scottorum but for all the annals. I am using the one file as a testbed to branch out into multi-file indexing and searching (at some point).

    I hope this helps answer some of your questions.

    Thanks for your input!

      Why not create a dictionary of words with lists of offsets. Store the byte-location of the start of the line so then when you want to retrieve you seek immediately to the right location and print that line. Another idea is to store the offset of the n-th previous line so you can print some context.

      I've written a function for you that accepts a filename and an optional number of lines of context ( the default being 1. You'll probably want to store the index somewhere using Storable so its convenient to re-use your index for later.

      my @files = glob "*.txt"; my %file_idx = map {; $_ => index_file( $_, 5 ) } @files; =pod { 'foobar.txt' => { word => [ 1, 3, 5, 6 ], another => [ 5, 7, 2, ] }, 'barfoo.txt' => { ....... } =cut sub index_file { my $filename = shift; my $lines_of_context = $_[0] > 0 ? shift() : 1; open my $fh, "<", $filename or die "Couldn't open $filename: $!"; my @offsets; my %index; while ( my $line = <$fh> ) { push @offsets, tell $fh; my $offset = scalar( @offsets ) < $lines_of_context ? $offsets[0] : shift @offsets; for my $word ( split ' ', $line ) { push @{ $index{$word} }, $offset; } } close $fh or warn "Couldn't close $filename: $!"; return \ %index; }

        I have implemented your function and added a few things from the code above including Cody Pendent's /\s+/ suggestion. I should have thought of these myself since I have done similar stuff in the past.

        The resulting Storable file is only a few thousand K more than my original text index file. Now all I need to do is implement a searcher. I may re-write this into an object structure for ease of maintence at some point but I like the way things are headed. I still need to add the stop list in a hash but I have some obligations I need to take care of today. I have appended the code with my changes below.

        I still have one more question however. Why do I need so many chdir functions? It seems unable to deal with "\..\index" or any of that sort of stuff. Thanks again!

        use strict; use warnings; use utf8; use Storable; chdir "texts"; my @files = glob "*.txt"; my %file_idx = map {; $_ => index_file( $_, 5 ) } @files; chdir "\.."; chdir "index"; store \%file_idx, "text.idx"; =pod { 'foobar.txt' => { word => [ 1, 3, 5, 6 ], another => [ 5, 7, 2, ] }, 'barfoo.txt' => { ....... } =cut sub index_file { my $filename = shift; my $lines_of_context = $_[0] > 0 ? shift() : 1; open my $fh, "<", $filename or die "Couldn't open $filename: $!"; my @offsets; my %index; while ( my $line = <$fh> ) { push @offsets, tell $fh; my $offset = scalar( @offsets ) < $lines_of_context ? $offsets[0] : shift @offsets; for my $word ( split /\s+/, $line ) { $word = lc $word; $word =~ s/,$|\.$|\[|\]|\(|\)|;|:|!//g; if(&inStopList($word)) { next; }elsif($word =~ /p\.(\d)+/) { next; }elsif($word =~ /\-{5,}?/) { next; } push @{ $index{$word} }, $offset; } } close $fh or warn "Couldn't close $filename: $!"; return \ %index; } sub inStopList { my $word = shift; my @stopList = ("the", "a", "an", "of", "and", "on", "in", "by", " +with", "at", "he", "after", "into", "their", "is", "that", "they", "f +or", "to", "it", "them", "which"); foreach my $stopWord (@stopList) { if($word eq $stopWord) { return $word; } else { next; } } }