Re^2: Regex Parsing Style

Replies are listed 'Best First'.
Re^3: Regex Parsing Style by aquarium (Curate) on Nov 26, 2010 at 03:12 UTC
that's a heuristic you need to work out. if the alternatives are not exclusive, then the leftest match will always match first. if it was me, i'd run each regex separately and additively collect flags for any matches. then at the end of this parsing you can decide exactly what you want to happen based on combinations of flags..possibly in a switch construct. the hardest line to type correctly is: stty erase ^H	[reply]
Re^4: Regex Parsing Style by Jim (Curate) on Nov 26, 2010 at 05:13 UTC
that's a heuristic you need to work out. Actually, there was a bug, which ikegami quickly fixed. In the original version of his lexing code, the first two alternative patterns matched every possible valid, non-empty string, making the remaining two alternative patterns unreachable. if the alternatives are not exclusive, then the leftest match will always match first. I explained the options are mutually exclusive in my original post. It's important that each alternative pattern matches one and only one class of token.	[reply]
Re^5: Regex Parsing Style by aquarium (Curate) on Nov 28, 2010 at 23:44 UTC
i believe you that the options are mutually exclusive, but i never program as such. it's difficult to guarantee non-trivial regexes will indeed match exclusively on all input data. and hence i would either pre-run all regexes or other such programming to eliminate non-exclusivity OR allow all regexes to match against input and make normal logic decisions + sane decisions on possible anomalies. that's the kind of defensive programming i'd do if time allows. it's always a balancing act in handling program input, but i think a little scepticism in programming itself is a good thing. hence my advice...even though it's not inline with the spec. take it or leave it as you please. the hardest line to type correctly is: stty erase ^H	[reply]
Re^3: Regex Parsing Style by ikegami (Patriarch) on Nov 26, 2010 at 00:39 UTC
Bug. Fixed. Thanks.	[reply]
Re^4: Regex Parsing Style by Jim (Curate) on Nov 26, 2010 at 06:39 UTC
This is still a paired-down version of my actual script, but it more accurately represents what I'm really doing: counting characters. use strict; use warnings; my %CONTROL_CODE = ( '\t' => 0x09, '\n' => 0x0a, '\f' => 0x0c, '\r' => 0x0d, ); my %character_count_by; while (<>) { chomp; pos = 0; TOKEN: while (1) { # Literal character if (m/\G ([^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Universal Character Name if (m/\G \\u([0-9a-f]{4}) /gcx) { $character_count_by{hex $1}++; next TOKEN; } # Literal character escape sequence if (m/\G \\(["^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Control code escape sequence if (m/\G (\\[tnfr]) /gcx) { $character_count_by{$CONTROL_CODE{$1}}++; next TOKEN; } # End of string if (m/\G \z /gcx) { last TOKEN; } # Invalid character die "Invalid character on line $. of file $ARGV\n"; } } for my $code (sort { $a <=> $b } keys %character_count_by) { printf "U+%04x\t%d\n", $code, $character_count_by{$code}; } [download] UPDATE: Changed `\Z` to `\z` and updated error message of event that can never happen.	[reply] [d/l] [select]
Re^5: Regex Parsing Style by ikegami (Patriarch) on Nov 26, 2010 at 06:50 UTC
/`\Z`/ should be /`\z`/ (my fault), and you shouldn't have kept the "`chr`". /`\\(["^\\])`/ looks buggy. Are you sure those are the only symbols that can be escaped? If it's not buggy, you'll need to adjust the error message since no case will handle '`\#`', for example.	[reply] [d/l] [select]
Re^6: Regex Parsing Style by Jim (Curate) on Nov 26, 2010 at 15:57 UTC
Re^7: Regex Parsing Style by ikegami (Patriarch) on Nov 26, 2010 at 17:15 UTC
Some notes below your chosen depth have not been shown here