in reply to Cleaning up text for indexing in DB

If it's just for indexing, I'd first try it with...

$_=join('',<INFILE>); s/\s+/ /g; # clean all whitespace s/<[^>]*>//g; # clean all HTML-Like tags s/[^a-z]/ /gi; # Remove all but letters grep ++$count{$_} && undef, split;
This will give you a hash of all words.

Replies are listed 'Best First'.
Re: Re: Cleaning up text for indexing in DB
by halley (Prior) on Jul 16, 2003 at 13:55 UTC
    "This won't help key-word scanning in a résumé."
    qw(This won t help key word scanning in a r sum)
    Apostrophes are part of a word but quotes are not. Hyphens are part of a word but dashes are not. The distinctions in a typed text are subtle. And the definition of a letter is not as simple as m/[a-z]/i.

    --
    [ e d @ h a l l e y . c c ]

      I decided when working on something similar that a word for me could contain a-z, single-quotes, and hyphens, then had to code around words in single quotes, so it wasn't as simple as /[a-z'-]/.

      I think I ended up with

      my @words = /(\w[\w'-]*\w|\w+)/g;
      or similar.

      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
      M-J D
      I know. Especially because I'm German and our letter-line would have to become (at least) s/[^a-zäöüßÄÖÜ]/ /i because of our Umlaute.

      Nevertheless. It's a first shot and it seemed to help TVSET in solving his problem.

Re: Re: Cleaning up text for indexing in DB
by TVSET (Chaplain) on Jul 16, 2003 at 16:26 UTC
    Thanks a lot. That is very close to what I wanted. I'll need to play with the "Remove all but letters" line, but overall it's very near. :)

    Leonid Mamtchenkov aka TVSET