cyocum has asked for the wisdom of the Perl Monks concerning the following question:
Hello Fellow Perlmonks
I have a medium sized text file, which I would like to index along with line number so that I can find where a word is in a file. I have already written some code, which is appended, however I was wondering if there were any good plain text indexers in Perl that also store where in a file a particular word is. I have looked at Apache Lucene. The only problem that I have with it is that it does not store where in a file a particular word is only that the word is in the file. I could use the find feature in whatever text editor I use but I would like to be able to search for a term over several files.
I have also looked at two nodes here and and here.
Any ideas?
UPDATE: I have updated the code a bit since I was not stripping the unwanted characters before I did some other things to the text.
use strict; use warnings; use utf8; use IO::File; #the file to index my $inFile = "c:\\temp\\texts\\T100001A.txt"; #the file to store the index information my $indexFile = "c:\\temp\\index\\t100001a.index"; my $inFh = new IO::File $inFile, "r"; my $outFh = new IO::File "$indexFile", "w"; my $lineNum = 0; my %index; while(my $line = <$inFh>) { $lineNum++; chomp $line; my @words = split /\s/, $line; foreach my $word (@words) { $word =~ s/,$|\.$|\[|\]|\(|\)|;|:|!//g; $word = lc $word; } @words = grep {!&inStopList($_);} @words; @words = grep {&removeNullEntries($_);} @words; foreach my $word (@words) { if(exists $index{$word}) { push @{$index{$word}}, $lineNum; } else { my @lineNums; push @lineNums, $lineNum; $index{$word} = \@lineNums; } } } print "done indexing\n"; foreach my $key (keys %index) { print $outFh $key; print $outFh "="; print $outFh join(',', @{$index{$key}}); print $outFh "\n"; } sub inStopList { my $word = shift; my @stopList = ("the", "a", "an", "of", "and", "on", "in", "by", " +with", "at", "he", "after", "into", "their", "is", "that", "they", "f +or", "to", "it", "them", "which"); foreach my $stopWord (@stopList) { if($word eq $stopWord) { return $word; } elsif($word =~ /p\.(\d)+/) { return $word; } elsif($word =~ /\-{5,}?/) { return $word; } else { next; } } } sub removeNullEntries { my $word = shift; if($word) { return $word; } else { return undef; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Simple Text Indexing
by jeffa (Bishop) on Nov 29, 2003 at 17:40 UTC | |
by cyocum (Curate) on Nov 29, 2003 at 18:16 UTC | |
by diotalevi (Canon) on Nov 29, 2003 at 18:46 UTC | |
by cyocum (Curate) on Nov 30, 2003 at 12:02 UTC | |
by ysth (Canon) on Nov 30, 2003 at 12:10 UTC | |
| |
by diotalevi (Canon) on Dec 08, 2003 at 21:05 UTC | |
|
Re: Simple Text Indexing
by Cody Pendant (Prior) on Nov 29, 2003 at 22:01 UTC | |
|
Re: Simple Text Indexing
by broquaint (Abbot) on Nov 29, 2003 at 23:23 UTC | |
|
Re: Simple Text Indexing
by cyocum (Curate) on Nov 30, 2003 at 01:33 UTC | |
|
Re: Simple Text Indexing
by bl0rf (Pilgrim) on Dec 01, 2003 at 01:00 UTC | |
by cyocum (Curate) on Dec 01, 2003 at 09:10 UTC | |
|
Re: Simple Text Indexing
by cyocum (Curate) on Jan 05, 2004 at 13:26 UTC | |
|
CLucene module for perl
by dpavlin (Friar) on Dec 01, 2003 at 19:43 UTC | |
by cyocum (Curate) on Dec 02, 2003 at 22:18 UTC | |
by quinkan (Monk) on Jan 31, 2004 at 15:58 UTC | |
|
Re: Simple Text Indexing
by cyocum (Curate) on Mar 10, 2004 at 12:51 UTC |