Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

chomp() is multi-platform. It will delete <CR><NL> and <NL>, even on Windows. These line endings even if mixed will not matter.

Well, it may look so, but what really happens is different. See chomp:

This safer version of chop removes any trailing string that corresponds to the current value of $/ (also known as $INPUT_RECORD_SEPARATOR in the English module).

Note: Not a single word of the CR or LF control characters, the CR-LF pair, or NL (newline).

The input record separator $/ is documented, it defaults to an abstract "newline" character:

The input record separator, newline by default. This influences Perl's idea of what a "line" is. [...] See also Newlines in perlport.

Now, "newlines". Perl has inherited them from C, by using two modes for accessing files, text mode and binary mode. In text mode, the systems native line ending, whatever that may be, is translated from or to a logical newline, also known as "\n". In binary mode, file content is not modified during read or write. C has been defined in a way that the logical newline is identical with the native line ending on unix, LF. So, there is no difference between text mode and binary mode ON unix.

Quoting Newlines in perlport:

In most operating systems, lines in files are terminated by newlines. Just what is used as a newline may vary from OS to OS. Unix traditionally uses \012, one type of DOSish I/O uses \015\012, Mac OS uses \015, and z/OS uses \025.

Perl uses \n to represent the "logical" newline, where what is logical may depend on the platform in use. In MacPerl, \n always means \015. On EBCDIC platforms, \n could be \025 or \045. In DOSish perls, \n usually means \012, but when accessing a file in "text" mode, perl uses the :crlf layer that translates it to (or from) \015\012, depending on whether you're reading or writing. Unix does the same thing on ttys in canonical mode. \015\012 is commonly referred to as CRLF.

What happens here is that Perl has reasonable defaults for text handling, so it opens files (including STDIN, STDOUT, STDERR) in text mode by default, $/ defaults to a single logical newline ("\n"), and so native newline characters are translated before chomp just removed that "\n", on any platform.

When reading text files using a non-native line ending, things will usually go wrong:

/tmp/demo>file *.txt linux-file.txt: ASCII text mac-file.txt: ASCII text, with CR line terminators windows-file.txt: ASCII text, with CRLF line terminators /tmp/demo>perl -MData::Dumper -E '$Data::Dumper::Useqq=1; for $fn (@AR +GV) { open $f,"<",$fn or die; @lines=<$f>; chomp @lines; say "$fn:"; +say Dumper(\@lines); }' *.txt linux-file.txt: $VAR1 = [ "A simple file generated", "on Linux with Unix", "line endings." ]; mac-file.txt: $VAR1 = [ "A simple file generated\ron Windows with Old Mac\rline endi +ngs.\r" ]; windows-file.txt: $VAR1 = [ "A simple file generated\r", "on Windows with Windows\r", "line endings.\r" ]; /tmp/demo>

Of course, it depends on the system you are using:

H:\tmp\demo>perl -MWin32::autoglob -MData::Dumper -E "$Data::Dumper::U +seqq=1; for $fn (@ARGV) { open $f,'<',$fn or die; @lines=<$f>; chomp +@lines; say qq<$fn:>; say Dumper(\@lines); }" *.txt linux-file.txt: $VAR1 = [ "A simple file generated", "on Linux with Unix", "line endings." ]; mac-file.txt: $VAR1 = [ "A simple file generated\ron Windows with Old Mac\rline endi +ngs.\r" ]; windows-file.txt: $VAR1 = [ "A simple file generated", "on Windows with Windows", "line endings." ]; H:\tmp\demo>

So, chomp is NOT cross-platform. It can handle input from native text files on all platform out of the box. But if you have to work with ASCII files with mixed line endings (CR, LF, CR-LF, LF-CR), chomp can't work reliably. This is not chomp's fault, neither is it perl's fault.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re^3: Regular expressions across multiple lines by afoken
in thread Regular expressions across multiple lines by abcd

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-18 01:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found