htmanning has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I'm building a rss feed with Perl. Despite asking people to paste only text, they often copy and paste from Word or other word processors and I end up with bad characters. I have been switching a lot of characters, but inevitably there is a character I don't catch. There seem to be some bullets that I can't catch and it makes the feed choke. Interestingly, if I print the feed to a flat file it can be opened in a browser even with the bad characters, but if I print it dynamically from Perl the browser shows an error. I guess my first question is why would that be?

Here are a few of the characters I'm switching out. How do I find the binary code for these bullets that are pasted into the form so I can escape them?

$text =~ s/&//gi; $text =~ s/Æ//gi; $text =~ s/ì//gi; $text =~ s/î//gi; $text =~ s/\n//eg; $text =~ s/\r//eg; $text =~ s/í/\'/gi; $text =~ s/-/ /gi; $text =~ s/\x9s/-/gi; $text =~ s/\x96/-/gi; $text =~ s/\x95/\<li\>/gi; $text =~ s/’/'/gi; $text =~ s/\?/?/gi; $text =~ s/0xa0/ /gi;
These are the bullets when copied out of the text, but using this code doesn't work:
$text =~ s/·//g; $text =~ s/o//g;

Replies are listed 'Best First'.
Re: Stripping bad characters in rss
by Corion (Patriarch) on Nov 06, 2022 at 19:53 UTC

    See HTML::Escape - most likely you don't want to escape stuff yourself.

    For your second question, you will need to think about the encoding that the text is in (as submitted to your program, see the Content-Encoding header), and the encoding your source code is in (did you place use utf8; at the top of your script).

Re: Stripping bad characters in rss
by kcott (Archbishop) on Nov 07, 2022 at 00:10 UTC

    G'day htmanning,

    Some of your regex modifiers aren't making any sense to me; for instance, /i in s/&//gi, /e in s/\n//eg, and so on. Perhaps with sample input and output that might have been clearer.

    It's also unclear what you mean by "the binary code for these bullets" and why you'd need them. Does [PDF] "Unicode code chart: General Punctuation -- Range: 2000–206F" help you at all?

    I put together this sample input:

    $ cat pm_11148011_char_del.txt
    A&
    BÆ
    Cæ
    D
    E
    FìÌ
    GíÍ
    H•I∙J◦
    K L
    21°C 0.5µl 4,000Å
    

    It doesn't show up there, but the fourth line is "D<carriage return><newline>" and the fifth is just "E<newline>".

    I would probably write something like the following (pm_11148011_char_del.pl) to do what I think you want:

    #!/usr/bin/env perl
    
    use 5.014;
    use warnings;
    use autodie;
    use utf8;
    use open IO => qw{:encoding(UTF-8) :std};
    
    my $in_file = 'pm_11148011_char_del.txt';
    my %replace = (
        'í' => "'",
        'Í' => "'",
    );
    
    my $text = do {
        local $/; open my $fh, '<', $in_file; <$fh>;
    };
    
    my $new_text = $text =~ y/&ÆæìÌ\r\n•∙◦//dr
        =~ s<([^ -~])><$replace{$1} // $1>egr;
    
    say 'Original text:';
    say $text;
    say 'New text:';
    say $new_text;
    

    If your Perl version is less than v5.14, you'll need to make a few adjustments. I'll assume you'll know what's required; ask if that's not the case.

    Output from test run:

    $ ./pm_11148011_char_del.pl
    Original text:
    A&
    BÆ
    Cæ
    D
    E
    FìÌ
    GíÍ
    H•I∙J◦
    K L
    21°C 0.5µl 4,000Å
    
    
    New text:
    ABCDEFG''HIJK L21°C 0.5µl 4,000Å
    

    I've used <pre> tags throughout so that you can see the characters instead of entities (that look like &#nnnn;).

    Edit: Within <pre> tags, I needed some changes; e.g. < to &lt;, [ to &#91;, and so on. I hope I've caught them all; do let me know if I missed any and I'll fix them up also.

    — Ken

Re: Stripping bad characters in rss
by haukex (Archbishop) on Nov 07, 2022 at 09:28 UTC
    $text =~ s/Æ//gi; $text =~ s/ì//gi;

    This hints to me that you might be having encoding issues rather than people actually entering these characters. Since you haven't shown any of your code of how you actually get $text, I can't make any suggestions as to how you can fix it, but you should be fixing this at the source.

    $text =~ s/&//gi; ... $text =~ s/\x95/\<li\>/gi;

    Are you processing the raw RSS (XML) directly? I already said it two weeks ago: Do not use regular expressions to process XML/HTML. In this case the correct solution would be to either use an XML module and only apply the regexes to the content of text nodes, or to process the text before putting it in the XML.

    $text =~ s/’/'/gi;

    I might suggest Text::Unidecode for these kind of replacements.

    $text =~ s/\x9s/-/gi;

    Use strict and warnings.

    These are the bullets when copied out of the text, but using this code doesn't work:

    Did you save your Perl file as UTF-8 and declare use utf8;? SSCCE.

Re: Stripping bad characters in rss
by Anonymous Monk on Nov 07, 2022 at 04:40 UTC
    Rather than trying to come up with a blacklist, which could keep shifting, why not use eg https://www.asciitable.com/ to create some regexes that only accept valid (to you) char ranges eg 0-9, a-z, A-Z and a reasonable set of punctuation chars, and reject all others ?