Re^4: Regex Parsing Style

This is still a paired-down version of my actual script, but it more accurately represents what I'm really doing: counting characters.

use strict;
use warnings;

my %CONTROL_CODE = (
    '\t' => 0x09,
    '\n' => 0x0a,
    '\f' => 0x0c,
    '\r' => 0x0d,
);

my %character_count_by;

while (<>) {
    chomp;

    pos = 0;

    TOKEN:
    while (1) {
        # Literal character
        if (m/\G ([^\\]) /gcx) {
            $character_count_by{ord $1}++;
            next TOKEN;
        }

        # Universal Character Name
        if (m/\G \\u([0-9a-f]{4}) /gcx) {
            $character_count_by{hex $1}++;
            next TOKEN;
        }

        # Literal character escape sequence
        if (m/\G \\(["^\\]) /gcx) {
            $character_count_by{ord $1}++;
            next TOKEN;
        }

        # Control code escape sequence
        if (m/\G (\\[tnfr]) /gcx) {
            $character_count_by{$CONTROL_CODE{$1}}++;
            next TOKEN;
        }

        # End of string
        if (m/\G \z /gcx) {
            last TOKEN;
        }

        # Invalid character
        die "Invalid character on line $. of file $ARGV\n";
    }
}

for my $code (sort { $a <=> $b } keys %character_count_by) {
    printf "U+%04x\t%d\n", $code, $character_count_by{$code};
}
[download]

UPDATE: Changed \Z to \z and updated error message of event that can never happen.

Comment on Re^4: Regex Parsing Style Select or Download Code

Replies are listed 'Best First'.
Re^5: Regex Parsing Style by ikegami (Patriarch) on Nov 26, 2010 at 06:50 UTC
/`\Z`/ should be /`\z`/ (my fault), and you shouldn't have kept the "`chr`". /`\\(["^\\])`/ looks buggy. Are you sure those are the only symbols that can be escaped? If it's not buggy, you'll need to adjust the error message since no case will handle '`\#`', for example.	[reply] [d/l] [select]
Re^6: Regex Parsing Style by Jim (Curate) on Nov 26, 2010 at 15:57 UTC
I removed `chr` very soon after posting. Also, I was wondering why you had used `\Z` instead of `\z`. Frankly, though I know they're considered the modern anchors to use, I still find `$` more immediately recognizeable and less confusing than `\Z` and `\z`. The set of graphic characters that must be escaped is exactly `{ '"', '\', '^' }`. The caret is the oddball. I think it's a carryover from another, different context in which control codes can be specified as two characters; e.g., `^Z`. Though such control code sequences never occur in the text I'm lexing (they're represented instead as UCNs; e.g., `\u001a`), all literals carets in the text are nonetheless escaped (needlessly). Thanks again.	[reply] [d/l] [select]
Re^7: Regex Parsing Style by ikegami (Patriarch) on Nov 26, 2010 at 17:15 UTC
The set of graphic characters that must be escaped is exactly { '"', '\', '^' }. It's funny that you emphasised "must" because that's exactly the word that makes that sentence irrelevant. At issue is what set can be escaped. Either way, a fix is needed. The set must be expanded, or an error message needs to be added.	[reply]
Re^8: Regex Parsing Style by Jim (Curate) on Nov 26, 2010 at 17:53 UTC