Re: Regex Parsing Style

Replies are listed 'Best First'.
Re^2: Regex Parsing Style by Jim (Curate) on Nov 26, 2010 at 00:35 UTC
Thank you, ikegami. I think the ordering of alternatives is off. The third and fourth alternatives can never be reached after the first and second alternatives. Or am I missing something?	[reply]
Re^3: Regex Parsing Style by aquarium (Curate) on Nov 26, 2010 at 03:12 UTC
that's a heuristic you need to work out. if the alternatives are not exclusive, then the leftest match will always match first. if it was me, i'd run each regex separately and additively collect flags for any matches. then at the end of this parsing you can decide exactly what you want to happen based on combinations of flags..possibly in a switch construct. the hardest line to type correctly is: stty erase ^H	[reply]
Re^4: Regex Parsing Style by Jim (Curate) on Nov 26, 2010 at 05:13 UTC
that's a heuristic you need to work out. Actually, there was a bug, which ikegami quickly fixed. In the original version of his lexing code, the first two alternative patterns matched every possible valid, non-empty string, making the remaining two alternative patterns unreachable. if the alternatives are not exclusive, then the leftest match will always match first. I explained the options are mutually exclusive in my original post. It's important that each alternative pattern matches one and only one class of token.	[reply]
Re^5: Regex Parsing Style by aquarium (Curate) on Nov 28, 2010 at 23:44 UTC
Re^3: Regex Parsing Style by ikegami (Patriarch) on Nov 26, 2010 at 00:39 UTC
Bug. Fixed. Thanks.	[reply]
Re^4: Regex Parsing Style by Jim (Curate) on Nov 26, 2010 at 06:39 UTC
This is still a paired-down version of my actual script, but it more accurately represents what I'm really doing: counting characters. use strict; use warnings; my %CONTROL_CODE = ( '\t' => 0x09, '\n' => 0x0a, '\f' => 0x0c, '\r' => 0x0d, ); my %character_count_by; while (<>) { chomp; pos = 0; TOKEN: while (1) { # Literal character if (m/\G ([^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Universal Character Name if (m/\G \\u([0-9a-f]{4}) /gcx) { $character_count_by{hex $1}++; next TOKEN; } # Literal character escape sequence if (m/\G \\(["^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Control code escape sequence if (m/\G (\\[tnfr]) /gcx) { $character_count_by{$CONTROL_CODE{$1}}++; next TOKEN; } # End of string if (m/\G \z /gcx) { last TOKEN; } # Invalid character die "Invalid character on line $. of file $ARGV\n"; } } for my $code (sort { $a <=> $b } keys %character_count_by) { printf "U+%04x\t%d\n", $code, $character_count_by{$code}; } [download] UPDATE: Changed `\Z` to `\z` and updated error message of event that can never happen.	[reply] [d/l] [select]
Re^5: Regex Parsing Style by ikegami (Patriarch) on Nov 26, 2010 at 06:50 UTC
Re^6: Regex Parsing Style by Jim (Curate) on Nov 26, 2010 at 15:57 UTC
Some notes below your chosen depth have not been shown here