Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: UTF-8 text files with Byte Order Mark

by ikegami (Patriarch)
on Feb 13, 2007 at 17:55 UTC ( [id://599731]=note: print w/replies, xml ) Need Help??


in reply to UTF-8 text files with Byte Order Mark

so I kinda assume that Perl will handle with this kind of stuff for me.

Having Perl remove the BOM automatically would be bad. print while <$fh>; would no longer print out a file exactly, for example. It wouldn't be possible to print out a file exactly by other means either.

However, if file contains that BOM, my program does not understand the first line in the file

Patient: "Doctor, it hurts when I do this."
Doctor: "So don't do it!"

If your program doesn't accept BOMs, don't feed it any. BOMs are not required.

Alternatively, you could change your spec and your program to accept it.

while (<$fh>) { s/\x{FEFF}//g; ... }

Replies are listed 'Best First'.
Re^2: UTF-8 text files with Byte Order Mark
by muba (Priest) on Feb 13, 2007 at 20:05 UTC
    Patient: "Doctor, it hurts when I do this."
    Doctor: "So don't do it!"

    Easy to say, of course, but what if the program one of my users uses stores that BOM anyway? Besides, as pointed out, a BOM in a utf-8 file *are* valid so I feel I should support it. Look, if the user was toying around with malformed files I'd be more than happy to tell him to get that fixed :D but apparently he's doing what he righteously thinks is righs.

      a BOM in a utf-8 file *are* valid

      "!" in an ASCII file is also valid. But if you place a "!" at the start of your Perl program, it probably will not compile. It is a malformed file, not from a UNICODE perspective, but from your parser's perspective.

      I provided two alternatives (removing the BOM and File::BOM) that will work with your broken tools (i.e. tools that add undesirable character to the files you edit). I'd go with them since allowing the BOM is surely a good thing.

        Ouch. I'm afraid I used the wrong tone in my previous reply. You see, I am now removing that BOM myself (as you can read below). I never meant to attack or critisize you. In fact, I much appreciate your input!

Re^2: UTF-8 text files with Byte Order Mark
by Anonymous Monk on Jul 24, 2019 at 20:56 UTC
    "If your program doesn't accept BOMs, don't feed it any. BOMs are not required. "

    This is a mindbogglingly stupid statement that ignores or even stands on its head the Robustness principle.
    Anyone who writes something so inane and so dangerous should be barred for life from software development.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://599731]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-18 17:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found