read whole file in a directory

ask91 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to create word list from input text file by graff (Chancellor) on Mar 29, 2012 at 05:25 UTC
Just a couple additional points that you might find useful: Keep your stop words in a separate text file, and read from that file to build up the regex as kennethk suggested above (this way, the stop-word list can be maintained separately from the program code): `open( STOPWORDS, '<', 'stopword.list' ); my @stopwords = <STOPWORDS>; chomp @stopwords; my $stopregex = join '\|', map qr/\b\Q$_\E\b/, @stopwords;` [download] You have a lot of substitutions being done on a word-by-word basis (after splitting each line on whitespace); while the "optimization" value is probably not significant (unless you get into a very large collection of input text), the coding would be much simpler using `tr///` on the lines, then splitting to get the tokens: my %freq; while (<FILE>) { s/<.+?>/ /g; # replace tags with spaces tr/A-Z0-9?!.,:;()*"`'-/a-z /s; # convert upper- to lower-case, an +d # also convert digits, punct to space # NOTE: check your output to see whether any other punctuation or # non-word characters are getting through, and add those to the tr/// # as needed; also: hyphens might need to be treated differently from # other punctuation (keep as-is, or delete, instead of converting to s +pace?) s/$stopregex//g; # remove any/all stop words # at this point, line should contain only word tokens, but # use grep, just in case: for my $token ( grep /[a-z]/, split ) { # only count tokens with + letters $freq{$token}++; } } [download] Last thing: I don't know if you intended it, but one of the quote characters in the OP code (that is, one of the marks being removed by `s///`) was apparently a non-ASCII character (U+201D, "right double quotation mark"). If you really are putting utf8 characters in your code, you may need to include `use utf8;` If your data is utf8 text, you may need to set utf8 mode in the open statement: `open( FILE, '<:utf8', $filename )`	[reply] [d/l] [select]
Re: how to create word list from input text file by choroba (Cardinal) on Mar 28, 2012 at 14:54 UTC
On line 22, you have a vertical bar instead of a slash before 'b'. BTW, why do you use double vertical bar? It means 'or nothing or' to Perl.	[reply]
Re: how to create word list from input text file by kennethk (Abbot) on Mar 28, 2012 at 15:44 UTC
choroba has pointed out bugs in your regular expression (see Metacharacters in perlre). To avoid this sort of mistake, rather than typing all that in, you should consider building a regular expression expression from a list of words, like perhaps: #penghilangan stopword my @words = qw( untuk dari di yang dan ini itu atau pada ke adalah setelah selalu daripada dengan dalam akan juga tidak karena tersebut ada bisa sebagai sudah saat oleh harus menjadi secara last modified lebih hanya para telah seperti sementara kepada namun sangat lalu belum bagi tak kalau bahwa tetapi dapat antara banyak kembali saja atas hingga melalui terjadi tapi sampai tentang sama agar memang lagi selama mencapai terus yakni the terhadap ketika merupakan sehingga sebuah jika bukan jadi sejumlah sejak perlu mulai jelas pun masih mengatakan menurut sekitar lain melakukan baru beberapa hal ); my $regex = join '\|', map qr/\b\Q$_\E\b/, @words; $kata =~ s/$regex//g; [download] Other changes you might consider include: strict and warnings are good. See Use strict warnings and diagnostics or die. A more natural way of expressing `$#ARGV + 1 != 1` might be `@ARGV != 1` Your second `$kata =~ tr/[A-Z]/[a-z]/;` is unnecessary, since you already lower-cased everything when building `%freq`. You have a whole bunch of substitutions for removing characters. Looking at them, I wonder if you really mean what you have written. For example, do you really want to remove the three character sequence "`”, or do you mean remove any occurrence of these three characters? (The escape before `"` is unnecessary) I think you would probably get your actual desired result replacing `$kata =~ s/\d+//g;`, $kata =~ s/[!.,()]\|\"`”//g; and `$kata =~ s/-+//g;` with $kata =~ s/[\d!.,()"`”\-+]//g; Update: Corrected oversight in replacement RE in 4. Thanks choroba. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^2: how to create word list from input text file by ask91 (Initiate) on Mar 28, 2012 at 17:11 UTC
Thanks to choroba & kennethk great.. :D it works in a blink i've got an output.dat with the right words written on it exactly the same fromthe input text well..im ashamed with my ability of building a code program, bcause it's still messy but thanks for helping me :) that's comment in my code written in Indonesian Language-by the way- my country. i realize how cool ProgrammingLanguages are,, even different people with different language could think unite with program code :D First ask yourself `How would I do this without a omputer?' Then have +the computer do it the same way.. [download] i try to..just..seemed i do not have them understand my messy code writing..hhe	[reply] [d/l]