Just a couple additional points that you might find useful:
Keep your stop words in a separate text file, and read from that file to build up the regex as kennethk suggested above (this way, the stop-word list can be maintained separately from the program code):
open( STOPWORDS, '<', 'stopword.list' );
my @stopwords = <STOPWORDS>;
chomp @stopwords;
my $stopregex = join '|', map qr/\b\Q$_\E\b/, @stopwords;
You have a lot of substitutions being done on a word-by-word basis (after splitting each line on whitespace); while the "optimization" value is probably not significant (unless you get into a very large collection of input text), the coding would be much simpler using tr/// on the lines, then splitting to get the tokens:
my %freq;
while (<FILE>) {
s/<.+?>/ /g; # replace tags with spaces
tr/A-Z0-9?!.,:;()*"`'-/a-z /s; # convert upper- to lower-case, an
+d
# also convert digits, punct to space
# NOTE: check your output to see whether any other punctuation or
# non-word characters are getting through, and add those to the tr///
# as needed; also: hyphens might need to be treated differently from
# other punctuation (keep as-is, or delete, instead of converting to s
+pace?)
s/$stopregex//g; # remove any/all stop words
# at this point, line should contain only word tokens, but
# use grep, just in case:
for my $token ( grep /[a-z]/, split ) { # only count tokens with
+ letters
$freq{$token}++;
}
}
Last thing: I don't know if you intended it, but one of the quote characters in the OP code (that is, one of the marks being removed by s///) was apparently a non-ASCII character (U+201D, "right double quotation mark"). If you really are putting utf8 characters in your code, you may need to include use utf8; If your data is utf8 text, you may need to set utf8 mode in the open statement: open( FILE, '<:utf8', $filename ) | [reply] [d/l] [select] |
On line 22, you have a vertical bar instead of a slash before 'b'.
BTW, why do you use double vertical bar? It means 'or nothing or' to Perl. | [reply] |
choroba has pointed out bugs in your regular expression (see Metacharacters in perlre). To avoid this sort of mistake, rather than typing all that in, you should consider building a regular expression expression from a list of words, like perhaps:
#penghilangan stopword
my @words = qw(
untuk
dari
di
yang
dan
ini
itu
atau
pada
ke
adalah
setelah
selalu
daripada
dengan
dalam
akan
juga
tidak
karena
tersebut
ada
bisa
sebagai
sudah
saat
oleh
harus
menjadi
secara
last
modified
lebih
hanya
para
telah
seperti
sementara
kepada
namun
sangat
lalu
belum
bagi
tak
kalau
bahwa
tetapi
dapat
antara
banyak
kembali
saja
atas
hingga
melalui
terjadi
tapi
sampai
tentang
sama
agar
memang
lagi
selama
mencapai
terus
yakni
the
terhadap
ketika
merupakan
sehingga
sebuah
jika
bukan
jadi
sejumlah
sejak
perlu
mulai
jelas
pun
masih
mengatakan
menurut
sekitar
lain
melakukan
baru
beberapa
hal
);
my $regex = join '|', map qr/\b\Q$_\E\b/, @words;
$kata =~ s/$regex//g;
Other changes you might consider include:
- strict and warnings are good. See Use strict warnings and diagnostics or die.
- A more natural way of expressing $#ARGV + 1 != 1 might be @ARGV != 1
- Your second $kata =~ tr/[A-Z]/[a-z]/; is unnecessary, since you already lower-cased everything when building %freq.
- You have a whole bunch of substitutions for removing characters. Looking at them, I wonder if you really mean what you have written. For example, do you really want to remove the three character sequence "`”, or do you mean remove any occurrence of these three characters? (The escape before " is unnecessary) I think you would probably get your actual desired result replacing $kata =~ s/\d+//g;, $kata =~ s/[!.,()*]|\"`”//g; and $kata =~ s/-+//g; with $kata =~ s/[\d!.,()*"`”\-+]//g;
Update: Corrected oversight in replacement RE in 4. Thanks choroba.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] [select] |
Thanks to choroba & kennethk
great.. :D it works in a blink
i've got an output.dat with the right words written on it
exactly the same fromthe input text
well..im ashamed with my ability of building a code program, bcause it's still messy
but thanks for helping me :)
that's comment in my code written in Indonesian Language-by the way- my country.
i realize how cool ProgrammingLanguages are,,
even different people with different language could think unite with program code :D
First ask yourself `How would I do this without a omputer?' Then have
+the computer do it the same way..
i try to..just..seemed i do not have them understand my messy code writing..hhe | [reply] [d/l] |