Re^2: Vowel search
by AnomalousMonk (Archbishop) on Jun 11, 2014 at 21:52 UTC
|
if ($file =~ /[aeiou]{2}/i) { ... }
Another small point: case-insensitive matching (enabled by the /i regex modiifer (which I see you snuck in there)) imposes a run-time penalty which will become noticable, at a wild guess, for files of more than several thousand lines. Maybe use
$file =~ m{ [AaEeIiOoUu]{2} }xms
to avoid this overhead. Of course, another approach would be to common-case all lines before matching...
| [reply] [d/l] [select] |
|
|
Well, if we're micro-optimizing, then either we want to do this…
[AaEeIiOoUu][AaEeIiOoUu]
…instead of this…
[AaEeIiOoUu]{2}
…or we want to rewrite the program in C.
| [reply] [d/l] [select] |
|
|
c:\@Work\Perl\monks>perl -wMstrict -le
"use Benchmark qw(cmpthese);
;;
print 'Perl version: ', $];
;;
my $s = 'Aid bears out ';
$s = $s x 10_000_000;
print 'length: ', length $s;
;;
cmpthese(-1, {
'/i' => sub { $s =~ m{ (?i) [aeiou]{2} }xmsg },
'[Aa]{2}' => sub { $s =~ m{ [AaEeIiOoUu]{2} }xmsg },
'[Aa][Aa]' => sub { $s =~ m{ [AaEeIiOoUu][AaEeIiOoUu] }xmsg },
});
"
Perl version: 5.014004
length: 140000000
Rate /i [Aa]{2} [Aa][Aa]
/i 3276565/s -- -8% -9%
[Aa]{2} 3558515/s 9% -- -1%
[Aa][Aa] 3600879/s 10% 1% --
The results are closer to what you suggest under ActiveState 5.8.9, but with /i still surprisingly high.
c:\@Work\Perl>perl -wMstrict -le
"(source code as above)
"
Perl version: 5.008009
length: 140000000
Rate [Aa]{2} /i [Aa][Aa]
[Aa]{2} 3276565/s -- -6% -16%
/i 3480139/s 6% -- -11%
[Aa][Aa] 3918166/s 20% 13% --
Still, as you say, it's a bit of a micro-optimization.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
I did add that in post, as it occurred to me that was an oversight the OP would likely make if a bread crumb were not left. Of course, the updated code does not include it, so I was unfortunately not obvious enough. If you're concerned with the match performance, it's probably more reasonable to use
$file =~ m{ [AaEeIiOoUu][aeiou] }x
since the simplified English the OP is likely attacking doesn't support two leading capital characters.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] |
|
|
[effect of /i modifier on] match performance ...
I thought there might actually be an effect, but tye's benchmark clearly shows otherwise, at least for "recent" Perl versions. (It occurred to me there might be a detectable difference if the benchmark were run against a large array of relatively short strings rather than against one really long one, but I haven't put this to the test yet.)
... two leading capital characters.
I hadn't thought of that aspect of the problem. Not just simplified English, but is there any modern English that supports two capital initials? The only example I can think of off the top of my head is a dipthong, e.g., Æ, but my understanding is that a dipthong is really a single character and I have no idea how it would fit into a "vowel" categorization. If the ligation is broken apart as in "Aesop", the dipthong becomes two quite ordinary vowels and would not, as you point out, both be capitalized. Isn't language (or at least orthography) wonderful?
| [reply] [d/l] |
Re^2: Vowel search
by Jim (Curate) on Jun 11, 2014 at 21:47 UTC
|
Of course, this doesn't handle the conditional nature of y as a vowel.
In how many ordinary words is y one of a pair of two consecutive vowels?
I assume you have plans to write a machine learning script to train against a dictionary so it can develop heuristics for resolving the ambiguity.
I assume the novice Perl programmer with the PerlMonks username Noob@Perl (noob == newbie == neophyte) has no such plans.
| [reply] |
|
|
roboticus@sparky:~$ grep -i -E 'y[aeiou]|[aeiou]y' /usr/share/dict/ame
+rican-english | wc -l
3244
roboticus@sparky:~$ wc -l /usr/share/dict/american-english
99171 /usr/share/dict/american-english
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] |
|
|
| [reply] |
|
|
|
|
|
|
How many words in English are ordinary? And how does one define a vowel? In the word thigh, the vowels are i, g, and h. That's certainly more than two consecutive vowel characters, though it's only one vowel sound.
You'll have to excuse a poor attempt at humor, attempting to illustrate how poorly constrained the spec is in actuality, and trying to highlight a distinct lack of effort on what strongly resembles a homework assignment.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] [select] |
|
|
I realize you were tacitly chastising the OP for what you thought was a poor post. I don't agree that it was all that bad. I think the OP is earnest and has demonstrated a genuine interest in learning Perl. He or she just seems daunted by the basics. And after all, the OP did let us know that he or she is a Perl novice by his or her choice of PerlMonk username.
I was tacitly chastising you right back for the glaring omission in your self-described "poor attempt at humor." You picked on the infrequent case of y's that are one of a pair of consecutive vowels, but you completely missed the case of all vowels with diacritical marks. What about them? What about the possibility of input text in different character encodings, both "legacy" and Unicode? What about Unicode combining characters and Unicode normalization forms? There's much more to say about the definition of "vowel" as any code point that matches the trivial regular expression pattern [AaEeIiOoUu] than what you wrote tauntingly about it in your reply to the OP.
| [reply] [d/l] |
Re^2: Vowel search
by Noob@Perl (Novice) on Jun 11, 2014 at 22:52 UTC
|
thanks for the references, I shall begin reading this too along with what I already have. :)
| [reply] |