Substitution bug on Unicode strings with Byte Order Mark (BOM)

tbusch has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, the following:

my $html = "\x{feff}<!DOCTYPE HTML PUBLIC> <div class=\"course_detail_
+box_content\"> </div>";
$html =~ s!.*?</div><div\s+class="course_detail_box_content">!!s;
[download]

yields the error

Malformed UTF-8 character (overflow at 0x3c8443c3, byte 0x54, after st
+art byte 0xbf) in substitution (s///)
[download]

for Perl 5.8.6 on MacOS X 10.4, Perl 5.8.8 on CentOS 5, Perl 5.10.1 on CentOS 6 but seems to work for 5.12.3 on MacOS X 10.7. Can someone confirm that this is a known bug and that it has been fixed from 5.12 onwards ?

Comment on Substitution bug on Unicode strings with Byte Order Mark (BOM) Select or Download Code

Replies are listed 'Best First'.
Re: Substitution bug on Unicode strings with Byte Order Mark (BOM) by moritz (Cardinal) on May 15, 2012 at 11:40 UTC
I don't know if it's known, but I can confirm it's not present in perl 5.14. With perl 5.10.1 I get warnings: `Malformed UTF-8 character (unexpected continuation byte 0xbb, with no +preceding start byte) in substitution (s///) at foo.pl line 2. Malformed UTF-8 character (unexpected continuation byte 0xbf, with no +preceding start byte) in substitution (s///) at foo.pl line 2. Malformed UTF-8 character (unexpected continuation byte 0xbf, with no +preceding start byte) in substitution (s///) at foo.pl line 2.` [download] Perl 6 - the future is here, just unevenly distributed	[reply] [d/l]
Re: Substitution bug on Unicode strings with Byte Order Mark (BOM) by ikegami (Patriarch) on May 17, 2012 at 19:49 UTC
A workaround is to start the pattern with `(?!(?!)\x{100})` (where 0x100 is an arbitrary non-byte character). `y $html = "\x{feff}<!DOCTYPE HTML PUBLIC> <div class=\"course_detail_b +ox_content\"> </div>"; $html =~ s{(?!(?!)\x{100}).*?</div><div\s+class="course_detail_box_con +tent">}{}s;` [download] `(?!(?!)...)` always matches exactly zero characters, and serves as a mechanism to include a non-byte character into the pattern.	[reply] [d/l] [select]