I just want to share some information with others about something that can cause you some hair loss if you are new to coding perl on windows. Pretty simple stuff.. But something that can really piss you off when you are doing that late night coding session at 3:00am in the morning and not thinking too straight.

Lets say you are dealing with data that you recieve from a socket. You decide to write it to disk for either another program to process or just to view the data in a hex editor.

As an example we will download the perlmonks home page.

#!/usr/bin/perl -w use strict; use Socket; my $proto = getprotobyname('tcp'); my $host = inet_aton('www.perlmonks.com') || die ("no host: $!\n"); my $paddr = sockaddr_in(80, $host); socket(SOCK, AF_INET, SOCK_STREAM, $proto) || die "Unable to connect socket: $!\n"; connect(SOCK, $paddr) || die "Unable to connect socket: $!\n"; my $get = "GET / HTTP/1.0\x0d\x0aUser-Agent: IE_SUCKS\x0d\x0a\x0d\x0a +"; syswrite (SOCK, $get, length($get)); my $data = join '', (<SOCK>); open (OUTT, ">test1.dat") or die "Couldn't open file: $!\n"; syswrite (OUTT, $data, length($data)); close (OUTT); close(SOCK);

Now being smart, you have read perlipc document and it specifies that the internet line terminator is \015\012. So you expect that each line of the data will be seperated by these line terminators.

Now lets open up test1.dat in UltraEdit32. The window will pop up with the html to the page. Now you switch to HexEdit mode and you look at the data and notice that each line is terminated with \015\012. Everything looks good. Now, the following code should print each line of the header with single quotes encasing the text. The first five lines of the header should be displayed as well.

open (IN, "test1.dat") or die "Couldn't open file: $!\n"; my $data = join '', (<IN>); close (IN); my ($header, $body) = split /\015\012\015\012/, $data; map { print "'$_'\n";} split /\015\012/, $header; print "\n\n"; map { print "'$_'\n";} (split /\015\012/, $body)[0..4];

When run, everything displays correctly but 7 lines of the body appear to be displayed. Looking closer, you realize that the whole head is surrounded by one set of single quotes. Looking back at the hex data in UltraEdit32 you notice that there are the hex bytes 0D 0A seperated the elements 1.  <head> 2. <TITLE> ..... </TITLE> 3. </head>. So these should each be treated like seperate lines. They are printed on seperate lines but where passed to the map function as one line... Obviously something is getting messed up with treatment of \r\n.

'HTTP/1.1 200 OK' 'Date: Tue, 12 Dec 2000 05:42:11 GMT' 'Server: Apache/1.3.9 (Unix) mod_perl/1.21' 'Connection: close' 'Content-Type: text/html' '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"' ' "http://www.w3.org/TR/REC-html40/loose.dtd"> ' '<HTML>' '<HEAD> <TITLE>The Monastery Gates</TITLE> </HEAD>' '<BODY text="#000000" bgcolor="#FFFFFF" link="#000066" vlink="#333399" +>'

Now, as an experiment lets see what happens if the test1.dat was transfered to a linux box and run. The output is the whole file being treated as the header.. WHAT??

Somethings going on. So I dont trust UltraEdit32's hex edit feature.. I download Hex Workshop and load up the file.. Now it appears that each line is seperated by a /015/015/012. Huh?? Where did the extra /015 come from?? Well... The file was opened to write to as a text file. When the file was written to disk, text mode writes treat CR (\015) as \015 and LF (\012) as \015\012. So since the internet line terminator is \015\012 it is written as \015\015\012. The code seemed to work correctly because when it was read back in, the \015\015\012 now is \015\012 again. Now from the data viewed within UltraEdit32, it is obvious that it opens the files in textmode even when viewing as hex so what you are seing is not necessarily what is on disk... What is interesting is how the single \012 in between the <HEAD>\012<TITLE>012</HEAD> showed up as 0D 0A in UltraEdit but perls read still saw it as a single line feed.

So what do you do... If you use binmode on the filehandle before reading and writing, the data will be treated as binary data and there will not be conversions done on the CR and LF's. Data will be written as you expected and will be treated the same when opened again on a non-windows machine.

MORAL OF THE STORY

1. UltraEdit's Hex mode sucks.... Use Hex Workshop.
2. If you are on a Win Machine, use binmode on your filehandles.


In reply to Windows CRLF confusion.. by zzspectrez

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.