in reply to Re^2: Regex Parsing Style
in thread Regex Parsing Style

Bug. Fixed. Thanks.

Replies are listed 'Best First'.
Re^4: Regex Parsing Style
by Jim (Curate) on Nov 26, 2010 at 06:39 UTC

    This is still a paired-down version of my actual script, but it more accurately represents what I'm really doing: counting characters.

    use strict; use warnings; my %CONTROL_CODE = ( '\t' => 0x09, '\n' => 0x0a, '\f' => 0x0c, '\r' => 0x0d, ); my %character_count_by; while (<>) { chomp; pos = 0; TOKEN: while (1) { # Literal character if (m/\G ([^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Universal Character Name if (m/\G \\u([0-9a-f]{4}) /gcx) { $character_count_by{hex $1}++; next TOKEN; } # Literal character escape sequence if (m/\G \\(["^\\]) /gcx) { $character_count_by{ord $1}++; next TOKEN; } # Control code escape sequence if (m/\G (\\[tnfr]) /gcx) { $character_count_by{$CONTROL_CODE{$1}}++; next TOKEN; } # End of string if (m/\G \z /gcx) { last TOKEN; } # Invalid character die "Invalid character on line $. of file $ARGV\n"; } } for my $code (sort { $a <=> $b } keys %character_count_by) { printf "U+%04x\t%d\n", $code, $character_count_by{$code}; }

    UPDATE: Changed \Z to \z and updated error message of event that can never happen.

      /\Z/ should be /\z/ (my fault), and you shouldn't have kept the "chr".

      /\\(["^\\])/ looks buggy. Are you sure those are the only symbols that can be escaped? If it's not buggy, you'll need to adjust the error message since no case will handle '\#', for example.

        I removed chr very soon after posting. Also, I was wondering why you had used \Z instead of \z. Frankly, though I know they're considered the modern anchors to use, I still find $ more immediately recognizeable and less confusing than \Z and \z.

        The set of graphic characters that must be escaped is exactly { '"', '\', '^' }. The caret is the oddball. I think it's a carryover from another, different context in which control codes can be specified as two characters; e.g., ^Z. Though such control code sequences never occur in the text I'm lexing (they're represented instead as UCNs; e.g., \u001a), all literals carets in the text are nonetheless escaped (needlessly).

        Thanks again.