tbusch has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, the following:
my $html = "\x{feff}<!DOCTYPE HTML PUBLIC> <div class=\"course_detail_ +box_content\"> </div>"; $html =~ s!.*?</div><div\s+class="course_detail_box_content">!!s;
yields the error
Malformed UTF-8 character (overflow at 0x3c8443c3, byte 0x54, after st +art byte 0xbf) in substitution (s///)
for Perl 5.8.6 on MacOS X 10.4, Perl 5.8.8 on CentOS 5, Perl 5.10.1 on CentOS 6 but seems to work for 5.12.3 on MacOS X 10.7. Can someone confirm that this is a known bug and that it has been fixed from 5.12 onwards ?

Replies are listed 'Best First'.
Re: Substitution bug on Unicode strings with Byte Order Mark (BOM)
by moritz (Cardinal) on May 15, 2012 at 11:40 UTC

    I don't know if it's known, but I can confirm it's not present in perl 5.14.

    With perl 5.10.1 I get warnings:

    Malformed UTF-8 character (unexpected continuation byte 0xbb, with no +preceding start byte) in substitution (s///) at foo.pl line 2. Malformed UTF-8 character (unexpected continuation byte 0xbf, with no +preceding start byte) in substitution (s///) at foo.pl line 2. Malformed UTF-8 character (unexpected continuation byte 0xbf, with no +preceding start byte) in substitution (s///) at foo.pl line 2.
Re: Substitution bug on Unicode strings with Byte Order Mark (BOM)
by ikegami (Patriarch) on May 17, 2012 at 19:49 UTC
    A workaround is to start the pattern with (?!(?!)\x{100}) (where 0x100 is an arbitrary non-byte character).
    y $html = "\x{feff}<!DOCTYPE HTML PUBLIC> <div class=\"course_detail_b +ox_content\"> </div>"; $html =~ s{(?!(?!)\x{100}).*?</div><div\s+class="course_detail_box_con +tent">}{}s;

    (?!(?!)...) always matches exactly zero characters, and serves as a mechanism to include a non-byte character into the pattern.