G'day htmanning,
Some of your regex modifiers aren't making any sense to me; for instance, /i in s/&//gi, /e in s/\n//eg, and so on. Perhaps with sample input and output that might have been clearer.
It's also unclear what you mean by "the binary code for these bullets" and why you'd need them. Does [PDF] "Unicode code chart: General Punctuation -- Range: 2000–206F" help you at all?
I put together this sample input:
$ cat pm_11148011_char_del.txt A& BÆ Cæ D E FìÌ GíÍ H•I∙J◦ K L 21°C 0.5µl 4,000Å
It doesn't show up there, but the fourth line is "D<carriage return><newline>" and the fifth is just "E<newline>".
I would probably write something like the following (pm_11148011_char_del.pl) to do what I think you want:
#!/usr/bin/env perl
use 5.014;
use warnings;
use autodie;
use utf8;
use open IO => qw{:encoding(UTF-8) :std};
my $in_file = 'pm_11148011_char_del.txt';
my %replace = (
'í' => "'",
'Í' => "'",
);
my $text = do {
local $/; open my $fh, '<', $in_file; <$fh>;
};
my $new_text = $text =~ y/&ÆæìÌ\r\n•∙◦//dr
=~ s<([^ -~])><$replace{$1} // $1>egr;
say 'Original text:';
say $text;
say 'New text:';
say $new_text;
If your Perl version is less than v5.14, you'll need to make a few adjustments. I'll assume you'll know what's required; ask if that's not the case.
Output from test run:
$ ./pm_11148011_char_del.pl Original text: A& BÆ Cæ D E FìÌ GíÍ H•I∙J◦ K L 21°C 0.5µl 4,000Å New text: ABCDEFG''HIJK L21°C 0.5µl 4,000Å
I've used <pre> tags throughout so that you can see the characters instead of entities (that look like &#nnnn;).
Edit: Within <pre> tags, I needed some changes; e.g. < to <, [ to [, and so on. I hope I've caught them all; do let me know if I missed any and I'll fix them up also.
— Ken
In reply to Re: Stripping bad characters in rss
by kcott
in thread Stripping bad characters in rss
by htmanning
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |