Re: Cleaning up text for indexing in DB

If it's just for indexing, I'd first try it with...

$_=join('',<INFILE>);
s/\s+/ /g;  # clean all whitespace
s/<[^>]*>//g; # clean all HTML-Like tags
s/[^a-z]/ /gi; # Remove all but letters
grep ++$count{$_} && undef, split;
[download]

This will give you a hash of all words.

Comment on Re: Cleaning up text for indexing in DB Download Code

Replies are listed 'Best First'.
Re: Re: Cleaning up text for indexing in DB by halley (Prior) on Jul 16, 2003 at 13:55 UTC
"This won't help key-word scanning in a résumé." `qw(This won t help key word scanning in a r sum)` [download] Apostrophes are part of a word but quotes are not. Hyphens are part of a word but dashes are not. The distinctions in a typed text are subtle. And the definition of a letter is not as simple as `m/[a-z]/i`. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: Re: Re: Cleaning up text for indexing in DB by Cody Pendant (Prior) on Jul 17, 2003 at 10:09 UTC
I decided when working on something similar that a word for me could contain a-z, single-quotes, and hyphens, then had to code around words in single quotes, so it wasn't as simple as `/[a-z'-]/`. I think I ended up with `my @words = /(\w[\w'-]\w\|\w+)/g;` [download] or similar. “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”* M-J D	[reply] [d/l] [select]
Re: Re: Re: Cleaning up text for indexing in DB by Skeeve (Parson) on Jul 17, 2003 at 05:48 UTC
I know. Especially because I'm German and our letter-line would have to become (at least) `s/[^a-zäöüßÄÖÜ]/ /i` because of our Umlaute. Nevertheless. It's a first shot and it seemed to help TVSET in solving his problem.	[reply] [d/l]
Re: Re: Cleaning up text for indexing in DB by TVSET (Chaplain) on Jul 16, 2003 at 16:26 UTC
Thanks a lot. That is very close to what I wanted. I'll need to play with the "Remove all but letters" line, but overall it's very near. :) Leonid Mamtchenkov aka TVSET	[reply]