These two subs emulate the 'show all characters' functionality of most good text editors. Tabs are changed to ' -> ' and spaces to dots '·' (chr 183 ASCII). The funny backwards P symbol '¶' (chr 182 ASCII) is used for newlines. We use the octal naming convention for these chars for convenience. chr 0266 = '¶' and chr 0267 = '·'

In the unlikely event these special chars are included in the text to be processed we hex encode them using the standard URL encoding convention of a % followed by two hex digits. We can then remove the URL encoding to regenerate our original text.

The output of the show sub is shown in the second 4 lines of data which were the output on the first 4 lines. They provide the test case for the encoding.

Efficiency hacks courtesty bbfu and Hofmator.

sub show { my @data = @_; for (@data) { s/\266/%B6/g; s/\267/%B7/g; tr/ /\267/; s/\t/ -> /g; s/\n/\266\n/g; } return wantarray ? @data : join'',@data; } sub hide { my @data = @_; for (@data) { s/ -> /\t/g; tr/\266//d; tr/\267/ /; s/%B6/\266/g; s/%B7/\267/g; } return wantarray ? @data : join'',@data; } @data = (<DATA>)[0..7]; print "Original data, including specials\n\n"; print @data; print "\n\nShow invisible chars in data\n\n"; print show(@data); print "\n\nHide and show data - should not change\n\n"; print hide(show(@data)); __DATA__ tab 4 spaces, trailing tab tab and 4 spaces -> tab¶ ····4·spaces,·trailing·tab -> ¶ -> ····tab·and·4·spaces¶ -> -> ¶ ¶

Replies are listed 'Best First'.
Re: Show All Characters in Text
by Hofmator (Curate) on Aug 08, 2001 at 15:13 UTC

    Just a few performance remarks

    • alternating on characters in a regex is not the way to go, use a character class.
    • for single character 1:1 replacements I'd use tr///
    • evaluation of the substituted values can be avoided
    • drop the useless /m modifier (which changes the behaviour of ^ and $, which you are not using anyway)
    Having said that, here is my version of the substitutions in show - for a more general (and quicker) approach I use a hash:
    BEGIN { my %mapping = ( "\t" => ' -> ', " " => chr(0267), "\n" => chr(0266)."\n", "\266" => '%B6', "\267" => '%B7', ); my $pattern = qr/([@{[join '', keys %mapping]}])/; sub show { my @data = @_; s/$pattern/$mapping{$1}/g for (@data); return wantarray ? @data : join'',@data; } }

    -- Hofmator

      All good points. Here is a little Benchmark that makes your point:

      use Benchmark; $iterations = 1000000; $name1 = "Alternation"; $code1 = ' $_="ABBA"; s/(A|B)/X/g; '; $name2 = "Class"; $code2 = ' $_="ABBA"; s/[AB]/X/g; '; $name3 = "Substitution"; $code3 = ' $_="ABBA"; s/A/X/g; '; $name4 = "Transliterate"; $code4 = ' $_="ABBA"; tr/A/X/; '; timethese($iterations, {$name1 => $code1, $name2 => $code2, $name3 => $code3, $name4 => $code4, } ); __END__ C:\>perl test.pl Benchmark: timing 1000000 iterations of Alternation, Class, Substituti +on, Transliterate... Alternation: 36 wallclock secs (35.92 usr + 0.00 sys = 35.92 CPU) @ + 27839.64/s (n=1000000) Class: 23 wallclock secs (23.12 usr + 0.00 sys = 23.12 CPU) @ + 43252.60/s (n=1000000) Substitution: 20 wallclock secs (20.81 usr + 0.00 sys = 20.81 CPU) @ + 48053.82/s (n=1000000) Transliterate: 8 wallclock secs ( 8.07 usr + 0.00 sys = 8.07 CPU) @ + 123915.74/s (n=1000000) C:\>

      But if we are going to get into a little recreational optimisation.....

      # mine.pl sub show { my @data = @_; for (@data) { s/\266/%B6/g; s/\267/%B7/g; tr/ /\267/; s/\t/ -> /g; s/\n/\266\n/g; } return wantarray ? @data : join'',@data; } @data = (<DATA>)[0..7]; my $start = time; for (0..100000) { my @new = show(@data) } printf "Mine takes %d seconds", (time - $start); __DATA__ tab 4 spaces, trailing tab tab and 4 spaces -> tab¶ ····4·spaces,·trailing·tab -> ¶ -> ····tab·and·4·spaces¶ -> -> ¶ ¶ # yours.pl my %mapping = ( "\t" => ' -> ', " " => chr(0267), "\n" => chr(0266)."\n", "\266" => '%B6', "\267" => '%B7', ); my $pattern = qr/([@{[join '', keys %mapping]}])/; sub show { my @data = @_; s/$pattern/$mapping{$1}/g for (@data); return wantarray ? @data : join'',@data; } @data = (<DATA>)[0..7]; my $start = time; for (0..100000) { my $new = show(@data) } printf "Yours takes %d seconds", (time - $start); __DATA__ tab 4 spaces, trailing tab tab and 4 spaces -> tab¶ ····4·spaces,·trailing·tab -> ¶ -> ····tab·and·4·spaces¶ -> -> ¶ ¶

      You can't use Benchmark fairly on yours as the hash set up is a once only so I just use time and run a loop on the respective subs. If you run this code you will notice that my code takes half as long as yours as there is no hash look up involved. Had to salvage a little face :-)

      TIMTOWTDI!

      C:\>perl mine.pl
      Mine takes 59 seconds
      C:\>perl yours.pl
      Yours takes 119 seconds
      C:\>

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

(bbfu) Re: Show All Characters in Text
by bbfu (Curate) on Aug 08, 2001 at 04:51 UTC

    What if the text contains URL encoded values; or text that "looks like" URL encoded values? I'm thinking that's more likely than the text containing the special characters, though I'm not certain.

    Sorry, don't have any good suggestions for getting around it. If this were primarily for displaying to a terminal, you could perhaps flag the real characters by doubling them with a backspace (^H) in between. *shrug*

    bbfu
    Seasons don't fear The Reaper.
    Nor do the wind, the sun, and the rain.
    We can be like they are.

      Yeah good point. It occurs to me that you only need to decode %B6 and %B7 as these are the encodings of the two special chars. With this tweak in place the only failure cases are if you have the strings %B6 and %B7 in the string you want to encode/decode. This is a pretty narrow failure zone and much better than the full URL decoding that I had there which as you point out would URL decode all %XX cases.

      This was meant more as a bit of fun anyway, encoding the specials was just to avoid looking really slack :-)

      You will always have some failure cases unless you reserve a flag char. 0xb6 and 0xb7 are not your usual run of the mill chars so you don't really have to encode them for most practical purposes and you don't even have to use the hide sub - just save the original string.

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print