How much can this text processing be optimized?

YAFZ has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I have written a simple Perl code to do some text processing but I'm sure it may be optimized. The tast is to process a text file, delete garbage from it and count the words inside (using hash), and then print out the list of the words alphabetically with their numbers on the second column. By garbage I mean strings which include numbers like: 345asfg435t, 345gf, er345, etc.

So here is the code (some characters in regex patt. may not display correct in your browsers because they are Turkish charachers, oops! they are converted to entities, anyway):

#! /usr/bin/perl
while (<STDIN>) {
    s/(\d+)?([a-zA-Z&#305;ü&#287;&#351;çö&#304;&#350;Ö&#286;Ü]+\d+)+|(
+\d+[a-zA-Z&#305;ü&#287;&#351;çö&#304;&#350;Ö&#286;Ü]+)//og;
    $MyString = $MyString . $_;
}

while ($MyString =~ /([a-zA-Z&#305;ü&#287;&#351;çö&#304;&#350;Ö&#286;Ü
+]+)/og) {
    $temp = $1;
    $temp =~ tr/[A-Z]/[a-z]/;
    $MyWordCount{$temp}++;
}

print "Word\t\t\tFrequency\n";
print "======\t\t\t===========\n";

#sorting alphabetically
foreach $key (sort keys %MyWordCount) { 
    if (length($key) > 7) {
    print "$key\t\t$MyWordCount{$key}\n";
    }
    else {
    print "$key\t\t\t$MyWordCount{$key}\n";
    }
}
[download]

This code is run on Win2K ActiveState perl version v5.8 in such a fashion:

perl csharpversusperl.pl<input.txt >output.txt

I'm comparing it with C# programs (done for an assignment) and C# programs also use command line redirection, not some system file processing methods. The current situation is that this Perl code takes 10 seconds to process a 750K file but one of the C# code (not very optimized though) takes about 3-4 seconds.

I'd like to learn if this code could be much faster, if I'm doing something wrong. Please enlighten me.

Comment on How much can this text processing be optimized? Download Code

Replies are listed 'Best First'.
Re: How much can this text processing be optimized? by ides (Deacon) on May 16, 2005 at 13:50 UTC
Right now you're pulling the entire file into a scalar while removing the junk and then lowercasing the words, and then counting them. Do this instead, just pseudo code since I can't really read your regexs: `while(<STDIN>) { s/remove junk strings//og; my @words = split(/\s+/, $_); # split into words foreach my $w (@words) { $MyWordCount{lc($w)}++; } }` [download] I would imagine that is faster than what you are currently doing. I say this because this way you aren't having to accumulate everything into one big scalar. Frank Wiles <frank@wiles.org> http://www.wiles.org	[reply] [d/l]
Re^2: How much can this text processing be optimized? by YAFZ (Pilgrim) on May 16, 2005 at 14:13 UTC
Much better now! I was thinking along the similar line, about that putting everything into a huge scalar at once issue and trying your solution gave about 2 sec. runtime. The C# code does one thing more, seperating each word into syllable but that's another story. Thanks for the hint.	[reply]
Re: How much can this text processing be optimized? by holli (Abbot) on May 16, 2005 at 13:59 UTC
How does this perform? `use strict; use warnings; my %h; while (<DATA>) { $h{$_}++ for map { lc } grep { /^[a-zA-Z]+$/ } split /\W+/; } for (sort keys %h) { printf "%10s %05d\n", $_, $h{$_}; } __DATA__ This line contains garbag3, and words! This line does not.` [download] Note: You will need to add your special characters to the character class uses in the grep-block, and alter my script to read from STDIN. Update: Output: `and 00001 contains 00001 does 00001 line 00002 not 00001 this 00002 words 00001` [download] Update: Running this on a 1.5 MB file containing the _DATA_-section repeatedly takes <1s on my system. (1.4 GHz AMD, Perl 5.8, Win XP) holli, /regexed monk/	[reply] [d/l] [select]
Re^2: How much can this text processing be optimized? by YAFZ (Pilgrim) on May 16, 2005 at 14:15 UTC
It took about 1.1 sec. on my command line redirection version (using your adapted code) and reminded me of Common Lisp. Now I'm a happier programmer, thanks for reminding the powerful map! ;-)	[reply]
Re: How much can this text processing be optimized? by thundergnat (Deacon) on May 16, 2005 at 14:31 UTC
Well several things could be improved. There is no need to remove the words that you don't want from the string, just include the words that you do want. If you are going to use precompiled regexes, I think you will get better performance by moving the regex outside of the loop. As was previously mentioned, accumulating the file into a scalar and then acting on the scalar is going to give you a large performance hit. If I was going to do something like this, (and I have in the past,) I would probably write it like this: `########################################################### #! /usr/bin/perl use warnings; use strict; my $word = qr/\b[a-zA-Z\x{131}ü\x{11F}\x{15F}çö\x{130}\x{15E}Ö\x{11E}Ü +]+\b/; my %mywordcount; while (<STDIN>) { my $line = $_; while ($line =~ /($word)/g){ $mywordcount{(lc $1)}++; } } print "Word\t\t\tFrequency\n"; print "======\t\t\t===========\n"; #sorting alphabetically print "$_\t\t", (length($_) > 7) ? '' : "\t", $mywordcount{$_}, "\n" f +or sort keys %mywordcount;` [download] Update: Changed to STDIN fh, I tested with default and forgot to change before posting. FWIW, on my system, this takes about 2 seconds to process a 1.6 MB file. Also, it isn't specified in the original post, but if you want to allow for words with an internal apostrophe, (don't, you'll, you're, etc.,) change the line `while ($line =~ /($word)/g){` [download] to `while ($line =~ /($word('$word)?)/g){` [download] Update 2: Sigh. I realized that my regex wouldn't work correctly with words containing non-ASCII character too. (\b doesn't work for multi-byte characters.) Should be: `my $word = qr/(?<!\p{Alpha})[a-zA-Z\x{131}ü\x{11F}\x{15F}çö\x{130}\x{1 +5E}Ö\x{11E}Ü]+(?!\p{Alpha})/;` [download]	[reply] [d/l] [select]
Re: How much can this text processing be optimized? by mrborisguy (Hermit) on May 16, 2005 at 13:49 UTC
if the only criteria for garbage is having a number in it, your first while loop could be: `while (<STDIN>) { $MyString = $MyString . $_ unless ( /\d/ ); }` [download] i think that would help somewhat. (untested code)	[reply] [d/l]
Re: How much can this text processing be optimized? by Roger (Parson) on May 16, 2005 at 13:48 UTC
Well, your code could be optimized, but as a pure perl code, it will be very hard to give performance boost of 3 times faster than it was. If it is the execution speed you are looking for, why not try the Inline::C module and write your function in C? It will certainly give you the performance boost you are looking for.	[reply]