in reply to Re: Finding word either side of a word match
in thread Finding word either side of a word match

Thanks for the articles which I've skimmed over to read fully later. What I'm trying to do is to create a concordance which allows the user to search a text and then find all the occurences of a word and some sample text to work out if that's the section they are looking for and where it is in the text.

Its part of a personal project to try and create some useful textual analytical tools. Also it seemed like a good way to extend my nascent knowledge of Perl into something practical whilst learning. I'll need to think about those two words.
  • Comment on Re^2: Finding word either side of a word match

Replies are listed 'Best First'.
Re^3: Finding word either side of a word match
by moritz (Cardinal) on Mar 03, 2008 at 14:51 UTC
    If you want to display context, then there's a better solution: For each each word store the position of the word in the file (in bytes) in the DB. When you want to show the context, you just seek that position (or let's say $position - 20), and read the next few bytes.

    That way you have to keep the indexed files at hand, but you avoid storing every word thrice in the DB.

      That's a far more elegant solution :)
      What would be the best way of finding the position in bytes? Its not something that I've come across yet.
        You can slurp the whole file into meory like this:
        open (my $handle, '<', $file) or die "Can't read '$file': $!"; my $contents = do { local $/; <$file> };

        And then when you match against that string, you can query pos $contents to get the position of the match, which is the same as the position in bytes. (Note that you will run into troubles with multi byte encodings this way).

        Another way is to read the file line by line, and track the number of characters that have been consumed so far:

        my $pos = 0; while (<$handle>){ my $line_len = length $_; # do that before chomping chomp; while (m/(\w+)/g){ my $word = $1; my $word_pos = $pos + pos; } $pos += $line_len; }

        See pos.