in reply to Re: Re: Simple Text Indexing
in thread Simple Text Indexing

Why not create a dictionary of words with lists of offsets. Store the byte-location of the start of the line so then when you want to retrieve you seek immediately to the right location and print that line. Another idea is to store the offset of the n-th previous line so you can print some context.

I've written a function for you that accepts a filename and an optional number of lines of context ( the default being 1. You'll probably want to store the index somewhere using Storable so its convenient to re-use your index for later.

my @files = glob "*.txt"; my %file_idx = map {; $_ => index_file( $_, 5 ) } @files; =pod { 'foobar.txt' => { word => [ 1, 3, 5, 6 ], another => [ 5, 7, 2, ] }, 'barfoo.txt' => { ....... } =cut sub index_file { my $filename = shift; my $lines_of_context = $_[0] > 0 ? shift() : 1; open my $fh, "<", $filename or die "Couldn't open $filename: $!"; my @offsets; my %index; while ( my $line = <$fh> ) { push @offsets, tell $fh; my $offset = scalar( @offsets ) < $lines_of_context ? $offsets[0] : shift @offsets; for my $word ( split ' ', $line ) { push @{ $index{$word} }, $offset; } } close $fh or warn "Couldn't close $filename: $!"; return \ %index; }

Replies are listed 'Best First'.
Re: Re: Re: Re: Simple Text Indexing
by cyocum (Curate) on Nov 30, 2003 at 12:02 UTC

    I have implemented your function and added a few things from the code above including Cody Pendent's /\s+/ suggestion. I should have thought of these myself since I have done similar stuff in the past.

    The resulting Storable file is only a few thousand K more than my original text index file. Now all I need to do is implement a searcher. I may re-write this into an object structure for ease of maintence at some point but I like the way things are headed. I still need to add the stop list in a hash but I have some obligations I need to take care of today. I have appended the code with my changes below.

    I still have one more question however. Why do I need so many chdir functions? It seems unable to deal with "\..\index" or any of that sort of stuff. Thanks again!

    use strict; use warnings; use utf8; use Storable; chdir "texts"; my @files = glob "*.txt"; my %file_idx = map {; $_ => index_file( $_, 5 ) } @files; chdir "\.."; chdir "index"; store \%file_idx, "text.idx"; =pod { 'foobar.txt' => { word => [ 1, 3, 5, 6 ], another => [ 5, 7, 2, ] }, 'barfoo.txt' => { ....... } =cut sub index_file { my $filename = shift; my $lines_of_context = $_[0] > 0 ? shift() : 1; open my $fh, "<", $filename or die "Couldn't open $filename: $!"; my @offsets; my %index; while ( my $line = <$fh> ) { push @offsets, tell $fh; my $offset = scalar( @offsets ) < $lines_of_context ? $offsets[0] : shift @offsets; for my $word ( split /\s+/, $line ) { $word = lc $word; $word =~ s/,$|\.$|\[|\]|\(|\)|;|:|!//g; if(&inStopList($word)) { next; }elsif($word =~ /p\.(\d)+/) { next; }elsif($word =~ /\-{5,}?/) { next; } push @{ $index{$word} }, $offset; } } close $fh or warn "Couldn't close $filename: $!"; return \ %index; } sub inStopList { my $word = shift; my @stopList = ("the", "a", "an", "of", "and", "on", "in", "by", " +with", "at", "he", "after", "into", "their", "is", "that", "they", "f +or", "to", "it", "them", "which"); foreach my $stopWord (@stopList) { if($word eq $stopWord) { return $word; } else { next; } } }
      Why do I need so many chdir functions? It seems unable to deal with "\..\index" or any of that sort of stuff.
      $x = "\..\index"; actually sets $x to "..index" (with a warning about the "\i" on some perl versions).

      Two things to fix: first, try using forward slash or doubling your backslash; second, '\..' is not meaningful, since the root directory should have no parent.

        Ah yes of course! It has been so long since I programmed that I had forgotten these things. Thanks!

      Comments are inline.

      use strict; use warnings; use Storable; use vars qw($CONTEXT_LINES $TEXTS %STOPWORDS $INDEX); $CONTEXT_LINES = 3; @STOPWORDS{ # Prefer reading these from a file. qw( the a an of and on in by with at he after into their is that they for to it them which) } = (); # prefer texts/*.txt over "chdir 'text'; glob '*.txt'" my @files = glob "texts/*.txt"; my %file_idx = map {; $_ => index_file( $_, $CONTEXT_LINES ) } @files; store \%file_idx, "../index/text.idx"; =pod { 'text.idx' => { word => [ 1, 3, 5, 6 ], another => [ 5, 7, 2, ] }, 'barfoo.txt' => { ....... } =cut sub index_file { my $filename = shift; my $lines_of_context = $_[0] > 0 ? shift() : 1; open my $fh, "<", $filename or die "Couldn't open $filename: $!"; my @offsets; my %index; while ( my $line = <$fh> ) { push @offsets, tell $fh; my $offset = scalar( @offsets ) < $lines_of_context ? $offsets[0] : shift @offsets; # Prefer ' ' over /\s+/ here. See perlfunc about this. for my $word ( split ' ', $line ) { $word = lc $word; # Prefer character classes to alternation when possible $word =~ s/[,.]$|[\][();:!]//g; next if exists $STOPWORDS{$word} # Prefer (\d+) to (\d)+ (unless that is *really* what + you mean) or $word =~ /p\.(\d+)/ # Remove the '?' as that makes the operation always s +ucceed. or $word =~ /-{5,}/; push @{ $index{$word} }, $offset; } } close $fh or warn "Couldn't close $filename: $!"; return \ %index; }