in reply to Vowel search

Please read How do I post a question effectively?. In particular, the code you've posted does not compile; if the compilation is your error, please make sure to include the error message in your post. The compilation error is related to a number of omitted curly brackets.

If I make assumptions about where to stick the curly brackets based upon your indentation, you have made an odd choice for reading your lines: you will actually scan lines multiple times. If you want to take in all the data at the start (wholly unnecessary here), it could be done much more cleanly as:

my @lines = <FH>;

Second, your choice to name your variable containing your current line $file is odd at best, and conflicts with the file name variable, which could easily lead to confusion.

Lastly, your use of regular expressions is quite unnecessarily computationally intensive, and doesn't actually require vowels be adjacent. A read through of perlretut would likely be enlightening. You probably mean something closer to

if ($file =~ /[aeiou]{2}/i) { print $file; }
Of course, this doesn't handle the conditional nature of y as a vowel. I assume you have plans to write a machine learning script to train against a dictionary so it can develop heuristics for resolving the ambiguity. Vowel should provide a sufficiently thorough background.

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Replies are listed 'Best First'.
Re^2: Vowel search
by AnomalousMonk (Archbishop) on Jun 11, 2014 at 21:52 UTC
    if ($file =~ /[aeiou]{2}/i) { ... }

    Another small point: case-insensitive matching (enabled by the  /i regex modiifer (which I see you snuck in there)) imposes a run-time penalty which will become noticable, at a wild guess, for files of more than several thousand lines. Maybe use
        $file =~ m{ [AaEeIiOoUu]{2} }xms
    to avoid this overhead. Of course, another approach would be to common-case all lines before matching...

      Well, if we're micro-optimizing, then either we want to do this…

      [AaEeIiOoUu][AaEeIiOoUu]

      …instead of this…

      [AaEeIiOoUu]{2}

      …or we want to rewrite the program in C.

        There doesn't seem to be a significant difference between the  [AaEeIiOoUu]{2} and  [AaEeIiOoUu][AaEeIiOoUu] variations under Strawberry 5.14.4, but I was a bit surprised that there's so little improvement over the  /i version.

        c:\@Work\Perl\monks>perl -wMstrict -le "use Benchmark qw(cmpthese); ;; print 'Perl version: ', $]; ;; my $s = 'Aid bears out '; $s = $s x 10_000_000; print 'length: ', length $s; ;; cmpthese(-1, { '/i' => sub { $s =~ m{ (?i) [aeiou]{2} }xmsg }, '[Aa]{2}' => sub { $s =~ m{ [AaEeIiOoUu]{2} }xmsg }, '[Aa][Aa]' => sub { $s =~ m{ [AaEeIiOoUu][AaEeIiOoUu] }xmsg }, }); " Perl version: 5.014004 length: 140000000 Rate /i [Aa]{2} [Aa][Aa] /i 3276565/s -- -8% -9% [Aa]{2} 3558515/s 9% -- -1% [Aa][Aa] 3600879/s 10% 1% --

        The results are closer to what you suggest under ActiveState 5.8.9, but with  /i still surprisingly high.

        c:\@Work\Perl>perl -wMstrict -le "(source code as above) " Perl version: 5.008009 length: 140000000 Rate [Aa]{2} /i [Aa][Aa] [Aa]{2} 3276565/s -- -6% -16% /i 3480139/s 6% -- -11% [Aa][Aa] 3918166/s 20% 13% --

        Still, as you say, it's a bit of a micro-optimization.

      I did add that in post, as it occurred to me that was an oversight the OP would likely make if a bread crumb were not left. Of course, the updated code does not include it, so I was unfortunately not obvious enough. If you're concerned with the match performance, it's probably more reasonable to use
      $file =~ m{ [AaEeIiOoUu][aeiou] }x
      since the simplified English the OP is likely attacking doesn't support two leading capital characters.

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        [effect of /i modifier on] match performance ...

        I thought there might actually be an effect, but tye's benchmark clearly shows otherwise, at least for "recent" Perl versions. (It occurred to me there might be a detectable difference if the benchmark were run against a large array of relatively short strings rather than against one really long one, but I haven't put this to the test yet.)

        ... two leading capital characters.

        I hadn't thought of that aspect of the problem. Not just simplified English, but is there any modern English that supports two capital initials? The only example I can think of off the top of my head is a dipthong, e.g., Æ, but my understanding is that a dipthong is really a single character and I have no idea how it would fit into a "vowel" categorization. If the ligation is broken apart as in "Aesop", the dipthong becomes two quite ordinary vowels and would not, as you point out, both be capitalized. Isn't language (or at least orthography) wonderful?

Re^2: Vowel search
by Jim (Curate) on Jun 11, 2014 at 21:47 UTC
    Of course, this doesn't handle the conditional nature of y as a vowel.

    In how many ordinary words is y one of a pair of two consecutive vowels?

    I assume you have plans to write a machine learning script to train against a dictionary so it can develop heuristics for resolving the ambiguity.

    I assume the novice Perl programmer with the PerlMonks username Noob@Perl (noob == newbie == neophyte) has no such plans.

      Jim:

      Hey, guy, today a bit of playing with my grey matter suggests they may be fairly common.... ;^D

      roboticus@sparky:~$ grep -i -E 'y[aeiou]|[aeiou]y' /usr/share/dict/ame +rican-english | wc -l 3244 roboticus@sparky:~$ wc -l /usr/share/dict/american-english 99171 /usr/share/dict/american-english

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        Very few of those 3,244 words have a y in them that is one of a pair of vowels. You've included all the consonant y's and silent y's in your count.

      How many words in English are ordinary? And how does one define a vowel? In the word thigh, the vowels are i, g, and h. That's certainly more than two consecutive vowel characters, though it's only one vowel sound.

      You'll have to excuse a poor attempt at humor, attempting to illustrate how poorly constrained the spec is in actuality, and trying to highlight a distinct lack of effort on what strongly resembles a homework assignment.


      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        I realize you were tacitly chastising the OP for what you thought was a poor post. I don't agree that it was all that bad. I think the OP is earnest and has demonstrated a genuine interest in learning Perl. He or she just seems daunted by the basics. And after all, the OP did let us know that he or she is a Perl novice by his or her choice of PerlMonk username.

        I was tacitly chastising you right back for the glaring omission in your self-described "poor attempt at humor." You picked on the infrequent case of y's that are one of a pair of consecutive vowels, but you completely missed the case of all vowels with diacritical marks. What about them? What about the possibility of input text in different character encodings, both "legacy" and Unicode? What about Unicode combining characters and Unicode normalization forms? There's much more to say about the definition of "vowel" as any code point that matches the trivial regular expression pattern [AaEeIiOoUu] than what you wrote tauntingly about it in your reply to the OP.

Re^2: Vowel search
by Noob@Perl (Novice) on Jun 11, 2014 at 22:52 UTC

    thanks for the references, I shall begin reading this too along with what I already have. :)