Re: Find what characters never appear

Replies are listed 'Best First'.
Re^2: Find what characters never appear by Narveson (Chaplain) on Sep 04, 2009 at 22:20 UTC
If we've seen `$chr` once, can we somehow avoid repeating the assignment to `$seen[ord($chr)]` during the rest of the read? Can we avoid even testing `$seen[ord($chr)]`? I'd like to make a regex that matches any of our dwindling array of unseen characters, and update this regex every time I update `$seen`. Has anybody done this?	[reply] [d/l] [select]
Re^3: Find what characters never appear by kennethk (Abbot) on Sep 04, 2009 at 23:21 UTC
If you want to avoid potential issues w/ regex metacharacters, you can use a set of hash keys to track what's been seen and rebuild the regex once for each character: #!/usr/bin/perl use strict; use warnings; my %char_hash = (); $char_hash{ chr($_) } = undef foreach (33 .. 127); my $chars = join "", keys %char_hash; my $regex = "([\Q$chars\E])"; while (<DATA>) { while (/$regex/g) { delete $char_hash{$1}; $chars = join "", keys %char_hash; $regex = "([\Q$chars\E])"; } } my @good_array = keys %char_hash; print @good_array; __DATA__ !"#$%&'()*+,-./01234567 89:;<=>?@ABCDE FGHIJKLMOPQRSTUVWXYZ[\]^_`abcdefghijklmnop qrstuvwxyz{\|}~ [download] though I feel like there must be a simpler way of implementing this approach.	[reply] [d/l]
Re^4: Find what characters never appear by Narveson (Chaplain) on Sep 05, 2009 at 13:35 UTC
This ran in just a few minutes against my big 2GB file. All I had to do was change the printable range to 33..126, change <DATA> to <>, and for my own curiosity, add `print "$1 seen on line $.\n";` after `delete $char_hash{$1};`	[reply] [d/l] [select]
Re^3: Find what characters never appear by almut (Canon) on Sep 04, 2009 at 23:01 UTC
Maybe something like this (demo with reduced charset): `#!/usr/bin/perl my $s = "fccccaaaaeaaaddaaaaabbcccaaacaaabbaaaa"; my $set = "[abcdefg]"; while ($s =~ /($set)/g) { my $ch = $1; $set =~ s/$ch//; # remove $ch from search set printf "found %s at %d -> regex now: %s\n", $ch, pos($s), $set; } __END__ found f at 1 -> regex now: [abcdeg] found c at 2 -> regex now: [abdeg] found a at 6 -> regex now: [bdeg] found e at 10 -> regex now: [bdg] found d at 14 -> regex now: [bg] found b at 21 -> regex now: [g]` [download] Update: kennethk noted that you would run into complications with regex metacharacters with this simple approach (when using the full ASCII set) — which is of course correct...	[reply] [d/l]