G'day htmanning,

Some of your regex modifiers aren't making any sense to me; for instance, /i in s/&//gi, /e in s/\n//eg, and so on. Perhaps with sample input and output that might have been clearer.

It's also unclear what you mean by "the binary code for these bullets" and why you'd need them. Does [PDF] "Unicode code chart: General Punctuation -- Range: 2000–206F" help you at all?

I put together this sample input:

$ cat pm_11148011_char_del.txt
A&
BÆ
Cæ
D
E
FìÌ
GíÍ
H•I∙J◦
K L
21°C 0.5µl 4,000Å

It doesn't show up there, but the fourth line is "D<carriage return><newline>" and the fifth is just "E<newline>".

I would probably write something like the following (pm_11148011_char_del.pl) to do what I think you want:

#!/usr/bin/env perl

use 5.014;
use warnings;
use autodie;
use utf8;
use open IO => qw{:encoding(UTF-8) :std};

my $in_file = 'pm_11148011_char_del.txt';
my %replace = (
    'í' => "'",
    'Í' => "'",
);

my $text = do {
    local $/; open my $fh, '<', $in_file; <$fh>;
};

my $new_text = $text =~ y/&ÆæìÌ\r\n•∙◦//dr
    =~ s<([^ -~])><$replace{$1} // $1>egr;

say 'Original text:';
say $text;
say 'New text:';
say $new_text;

If your Perl version is less than v5.14, you'll need to make a few adjustments. I'll assume you'll know what's required; ask if that's not the case.

Output from test run:

$ ./pm_11148011_char_del.pl
Original text:
A&
BÆ
Cæ
D
E
FìÌ
GíÍ
H•I∙J◦
K L
21°C 0.5µl 4,000Å


New text:
ABCDEFG''HIJK L21°C 0.5µl 4,000Å

I've used <pre> tags throughout so that you can see the characters instead of entities (that look like &#nnnn;).

Edit: Within <pre> tags, I needed some changes; e.g. < to &lt;, [ to &#91;, and so on. I hope I've caught them all; do let me know if I missed any and I'll fix them up also.

— Ken


In reply to Re: Stripping bad characters in rss by kcott
in thread Stripping bad characters in rss by htmanning

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.