in reply to Find what characters never appear

For every character in the file, set (or increment) $seen[ord($ch)]. When you're through the file, the unset elements of the array @seen (indices 0..255) are the bytes that didn't occur...

Replies are listed 'Best First'.
Re^2: Find what characters never appear
by Narveson (Chaplain) on Sep 04, 2009 at 22:20 UTC

    If we've seen $chr once, can we somehow avoid repeating the assignment to $seen[ord($chr)] during the rest of the read?

    Can we avoid even testing $seen[ord($chr)]?

    I'd like to make a regex that matches any of our dwindling array of unseen characters, and update this regex every time I update $seen. Has anybody done this?

      If you want to avoid potential issues w/ regex metacharacters, you can use a set of hash keys to track what's been seen and rebuild the regex once for each character:

      #!/usr/bin/perl use strict; use warnings; my %char_hash = (); $char_hash{ chr($_) } = undef foreach (33 .. 127); my $chars = join "", keys %char_hash; my $regex = "([\Q$chars\E])"; while (<DATA>) { while (/$regex/g) { delete $char_hash{$1}; $chars = join "", keys %char_hash; $regex = "([\Q$chars\E])"; } } my @good_array = keys %char_hash; print @good_array; __DATA__ !"#$%&'()*+,-./01234567 89:;<=>?@ABCDE FGHIJKLMOPQRSTUVWXYZ[\]^_`abcdefghijklmnop qrstuvwxyz{|}~

      though I feel like there must be a simpler way of implementing this approach.

        This ran in just a few minutes against my big 2GB file.

        All I had to do was change the printable range to 33..126, change <DATA> to <>, and for my own curiosity, add print "$1 seen on line $.\n"; after delete $char_hash{$1};

      Maybe something like this  (demo with reduced charset):

      #!/usr/bin/perl my $s = "fccccaaaaeaaaddaaaaabbcccaaacaaabbaaaa"; my $set = "[abcdefg]"; while ($s =~ /($set)/g) { my $ch = $1; $set =~ s/$ch//; # remove $ch from search set printf "found %s at %d -> regex now: %s\n", $ch, pos($s), $set; } __END__ found f at 1 -> regex now: [abcdeg] found c at 2 -> regex now: [abdeg] found a at 6 -> regex now: [bdeg] found e at 10 -> regex now: [bdg] found d at 14 -> regex now: [bg] found b at 21 -> regex now: [g]

      Update: kennethk noted that you would run into complications with regex metacharacters with this simple approach (when using the full ASCII set) — which is of course correct...