in reply to Re: Vowel search
in thread Vowel search

if ($file =~ /[aeiou]{2}/i) { ... }

Another small point: case-insensitive matching (enabled by the  /i regex modiifer (which I see you snuck in there)) imposes a run-time penalty which will become noticable, at a wild guess, for files of more than several thousand lines. Maybe use
    $file =~ m{ [AaEeIiOoUu]{2} }xms
to avoid this overhead. Of course, another approach would be to common-case all lines before matching...

Replies are listed 'Best First'.
Re^3: Vowel search
by Jim (Curate) on Jun 11, 2014 at 22:04 UTC

    Well, if we're micro-optimizing, then either we want to do this…

    [AaEeIiOoUu][AaEeIiOoUu]

    …instead of this…

    [AaEeIiOoUu]{2}

    …or we want to rewrite the program in C.

      There doesn't seem to be a significant difference between the  [AaEeIiOoUu]{2} and  [AaEeIiOoUu][AaEeIiOoUu] variations under Strawberry 5.14.4, but I was a bit surprised that there's so little improvement over the  /i version.

      c:\@Work\Perl\monks>perl -wMstrict -le "use Benchmark qw(cmpthese); ;; print 'Perl version: ', $]; ;; my $s = 'Aid bears out '; $s = $s x 10_000_000; print 'length: ', length $s; ;; cmpthese(-1, { '/i' => sub { $s =~ m{ (?i) [aeiou]{2} }xmsg }, '[Aa]{2}' => sub { $s =~ m{ [AaEeIiOoUu]{2} }xmsg }, '[Aa][Aa]' => sub { $s =~ m{ [AaEeIiOoUu][AaEeIiOoUu] }xmsg }, }); " Perl version: 5.014004 length: 140000000 Rate /i [Aa]{2} [Aa][Aa] /i 3276565/s -- -8% -9% [Aa]{2} 3558515/s 9% -- -1% [Aa][Aa] 3600879/s 10% 1% --

      The results are closer to what you suggest under ActiveState 5.8.9, but with  /i still surprisingly high.

      c:\@Work\Perl>perl -wMstrict -le "(source code as above) " Perl version: 5.008009 length: 140000000 Rate [Aa]{2} /i [Aa][Aa] [Aa]{2} 3276565/s -- -6% -16% /i 3480139/s 6% -- -11% [Aa][Aa] 3918166/s 20% 13% --

      Still, as you say, it's a bit of a micro-optimization.

        It is too bad that people posting "benchmark" code so rarely take the care to validate that the numbers that they are posting actually have any meaning behind them. For example, when you get a "Rate" like "3918166/s", then you should probably just throw away your benchmark results as having no practical meaning.

        In this particular case, this was a hint that your $s = $s x 10_000_000­; line was completely pointless because of other aspects of your code. Fixing that problem (and reducing the string size significantly so it will finish before I need to go to bed), gives:

        Perl version: v5.12.0 Length of string: 140000 Rate [Aa]{2} lc [Aa][Aa] /i [Aa]{2} 50.7/s -- -0% -1% -3% lc 50.8/s 0% -- -1% -3% [Aa][Aa] 51.5/s 1% 1% -- -2% /i 52.5/s 4% 3% 2% --

        which shows how the performance differences are minimal (and the actual difference you'll see in the performance of a real script will be even smaller than that).

        So the "which will become noticable" assertion was wrong.

        Update: Here, here! Learning++, (Good attitude)++.

        - tye        

        Of course, another approach would be to common-case all lines before matching...

        Adding this other approach to your benchmark test…

        C:\>type 634253.pl #!perl use strict; use warnings; use Benchmark qw( cmpthese ); use English qw( -no_match_vars ); local $OUTPUT_RECORD_SEPARATOR = "\n"; print 'Perl version: ', $PERL_VERSION; my $s = 'Aid bears out ' x 10_000_000; print 'Length of string: ', length $s; cmpthese(-1, { 'lc' => sub { lc($s) =~ m{ [aeiou][aeiou] }xmsg }, '[Aa][Aa]' => sub { $s =~ m{ [AaEeIiOoUu][AaEeIiOoUu] }xmsg }, '[Aa]{2}' => sub { $s =~ m{ [AaEeIiOoUu]{2} }xmsg }, '/i' => sub { $s =~ m{ (?i) [aeiou]{2} }xmsg }, }); exit 0; C:\>perl 634253.pl Perl version: v5.16.2 Length of string: 140000000 Rate lc [Aa][Aa] /i [Aa]{2} lc 5.20/s -- -100% -100% -100% [Aa][Aa] 2915419/s 56073120% -- -23% -25% /i 3764119/s 72396452% 29% -- -3% [Aa]{2} 3877869/s 74584244% 33% 3% -- C:\>perl 634253.pl Perl version: v5.16.2 Length of string: 140000000 Rate lc [Aa]{2} /i [Aa][Aa] lc 4.94/s -- -100% -100% -100% [Aa]{2} 2814401/s 57019656% -- -8% -25% /i 3044909/s 61689750% 8% -- -19% [Aa][Aa] 3762832/s 76234868% 34% 24% -- C:\>
Re^3: Vowel search
by kennethk (Abbot) on Jun 12, 2014 at 15:24 UTC
    I did add that in post, as it occurred to me that was an oversight the OP would likely make if a bread crumb were not left. Of course, the updated code does not include it, so I was unfortunately not obvious enough. If you're concerned with the match performance, it's probably more reasonable to use
    $file =~ m{ [AaEeIiOoUu][aeiou] }x
    since the simplified English the OP is likely attacking doesn't support two leading capital characters.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      [effect of /i modifier on] match performance ...

      I thought there might actually be an effect, but tye's benchmark clearly shows otherwise, at least for "recent" Perl versions. (It occurred to me there might be a detectable difference if the benchmark were run against a large array of relatively short strings rather than against one really long one, but I haven't put this to the test yet.)

      ... two leading capital characters.

      I hadn't thought of that aspect of the problem. Not just simplified English, but is there any modern English that supports two capital initials? The only example I can think of off the top of my head is a dipthong, e.g., Æ, but my understanding is that a dipthong is really a single character and I have no idea how it would fit into a "vowel" categorization. If the ligation is broken apart as in "Aesop", the dipthong becomes two quite ordinary vowels and would not, as you point out, both be capitalized. Isn't language (or at least orthography) wonderful?