GrammyPuter has asked for the wisdom of the Perl Monks concerning the following question:

I have an epub file which somehow has garbage in it. I want to replace the garbage with appropriate html. Here is what I have.

$parse_string =~ s/[\xE2][\x80][\x9C]/"/g;

This isn't working. I wish the wisdom of the perl monks on how to replace whole strings of unprintable garbage rendered as hex

Thank You.

Replies are listed 'Best First'.
Re: Search and Replace Garbage
by kcott (Archbishop) on May 28, 2012 at 23:52 UTC

    "This isn't working." is a totally inadequate problem description.

    Both " and, the less cryptic, " are HTML character entity references for a double quote ("). Your regex works for me:

    $ perl -Mstrict -Mwarnings -E 'my $x = qq{X\xE2\x80\x9CX}; $x =~ s/[\x +E2][\x80][\x9C]/"/g; say $x' X"X $ perl -Mstrict -Mwarnings -E 'my $x = qq{X\xE2\x80\x9CX}; $x =~ s/[\x +E2][\x80][\x9C]/"/g; say $x' X"X

    You can also substitute a literal double quote:

    $ perl -Mstrict -Mwarnings -E 'my $x = qq{X\xE2\x80\x9CX}; $x =~ s/[\x +E2][\x80][\x9C]/"/g; say $x' X"X

    Please read: How do I post a question effectively?

    -- Ken

Re: Search and Replace Garbage
by Anonymous Monk on May 28, 2012 at 20:37 UTC

    U+E280 is character = 57984:   - Unicode = U+E280; Decimal = 57984; HTML = .

    That you're seeing unprintable garbage rendered as hex is a feature of your browser, when you don't have a font which can display said unicode character, you see the hex value

    Install some asian unicode fonts, so you can see something other than hex

      I think he means the character with the UTF-8 encoding of "E2 80 9C", U+201C LEFT DOUBLE QUOTATION MARK. However, it is still very unclear on how these "garbage" characters are stored in his file -- a hex dump would be great.

Re: Search and Replace Garbage
by GrammyPuter (Initiate) on May 28, 2012 at 21:13 UTC

    As I said, it's an EPub with garbage. It's not passing the validation. I need to replace the string with a quote, which is what it is supposed to be. The browser is displaying it correctly, but I want it to display a double-quote instead.

    Epubs have their own rules. I am working on an exploded Epub where the original file got corrupted, and I don't want to fix all the files in an editor - they are almost a gig.