in reply to Re: Cleaning up text for indexing in DB
in thread Cleaning up text for indexing in DB

"This won't help key-word scanning in a résumé."
qw(This won t help key word scanning in a r sum)
Apostrophes are part of a word but quotes are not. Hyphens are part of a word but dashes are not. The distinctions in a typed text are subtle. And the definition of a letter is not as simple as m/[a-z]/i.

--
[ e d @ h a l l e y . c c ]

Replies are listed 'Best First'.
Re: Re: Re: Cleaning up text for indexing in DB
by Cody Pendant (Prior) on Jul 17, 2003 at 10:09 UTC
    I decided when working on something similar that a word for me could contain a-z, single-quotes, and hyphens, then had to code around words in single quotes, so it wasn't as simple as /[a-z'-]/.

    I think I ended up with

    my @words = /(\w[\w'-]*\w|\w+)/g;
    or similar.

    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
    M-J D
Re: Re: Re: Cleaning up text for indexing in DB
by Skeeve (Parson) on Jul 17, 2003 at 05:48 UTC
    I know. Especially because I'm German and our letter-line would have to become (at least) s/[^a-zäöüßÄÖÜ]/ /i because of our Umlaute.

    Nevertheless. It's a first shot and it seemed to help TVSET in solving his problem.