Well several things could be improved.

There is no need to remove the words that you don't want from the string, just include the words that you do want.

If you are going to use precompiled regexes, I think you will get better performance by moving the regex outside of the loop.

As was previously mentioned, accumulating the file into a scalar and then acting on the scalar is going to give you a large performance hit.

If I was going to do something like this, (and I have in the past,) I would probably write it like this:

########################################################### #! /usr/bin/perl use warnings; use strict; my $word = qr/\b[a-zA-Z\x{131}ü\x{11F}\x{15F}çö\x{130}\x{15E}Ö\x{11E}Ü +]+\b/; my %mywordcount; while (<STDIN>) { my $line = $_; while ($line =~ /($word)/g){ $mywordcount{(lc $1)}++; } } print "Word\t\t\tFrequency\n"; print "======\t\t\t===========\n"; #sorting alphabetically print "$_\t\t", (length($_) > 7) ? '' : "\t", $mywordcount{$_}, "\n" f +or sort keys %mywordcount;

Update: Changed to STDIN fh, I tested with default and forgot to change before posting.

FWIW, on my system, this takes about 2 seconds to process a 1.6 MB file.

Also, it isn't specified in the original post, but if you want to allow for words with an internal apostrophe, (don't, you'll, you're, etc.,) change the line

while ($line =~ /($word)/g){

to

while ($line =~ /($word('$word)?)/g){

Update 2: Sigh. I realized that my regex wouldn't work correctly with words containing non-ASCII character too. (\b doesn't work for multi-byte characters.)

Should be:

my $word = qr/(?<!\p{Alpha})[a-zA-Z\x{131}ü\x{11F}\x{15F}çö\x{130}\x{1 +5E}Ö\x{11E}Ü]+(?!\p{Alpha})/;

In reply to Re: How much can this text processing be optimized? by thundergnat
in thread How much can this text processing be optimized? by YAFZ

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.