line endings in remote documents

Amoe has asked for the wisdom of the Perl Monks concerning the following question:

Hey.

I've been playing with an update feature on one of my scripts. What it does is it gets a raw text file from a website, which is a newline-delimited list of other resources my program can use. I should store this locally. Thing is, my program will already know about some of the URLs in the list. It builds a hash at startup called %resources, with the keys being the locations it knows about and the values being 1 so that I can just lookup to know if I know about something already. So what I want to do, then is get the file, parse it, filter out teh stuff I know about, and then append the new stuff to the end of the local file. Easy enough, thought I. I started off with this:

my $repository = $switches{u} =~ /^http:\/\// ? $switches{u} : 'http:/
+/www.myserver.com/mysite/update.txt';
my $raw = get_object($repository, 'pronbot_update_tgps');
my @lines = split /\n/, $raw;
my @new;
!$resources{$_} && push(@new, $_) foreach (@lines);
[download]

Then, I thought, @new would contain all the unknown URLs, and it would then be trivial to join them with newline and write them to the file. I've stumbled across a problem, though. When I've made @lines, all its elements seem to have the string '\cM' at the end of them. I'm running my code on Windows.

I think this must be a OS line-ending problem. I thought that "\n" adapted to that, though, and split would remove these when I split on that pattern. Apparently not, and I haven't got a clue how to do it apart from looping through and removing the literal pattern, and I figure there must be a better way than this. What if someone who uses it wants to change the repository via -u, to a server which uses different line-endings? That may well mess up the code if I just remove the pattern.

Anyone able to enlighten me about this baffling problem?

--
my one true love

Comment on line endings in remote documents Select or Download Code

Replies are listed 'Best First'.
Re: line endings in remote documents by Juerd (Abbot) on Dec 20, 2001 at 20:51 UTC
To substitute Macintosh (\cM), DOS (\cM\cJ) and Unix (\cJ) linefeeds, use `s/\r\n?\|\n/(something)/g`. If you just want to remove them, `tr/\cM\cJ//d` is a lot faster. `2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$` [download]	[reply] [d/l]
(tye)Re2: line endings in remote documents by tye (Sage) on Dec 20, 2001 at 23:55 UTC
The conversion, under Windows, from "\r\n" to "\n" happens (only!) when reading from a file that isn't in binmode. It sounds like you are reading these file contents over a socket, which are always in binmode (since you don't want to assume that the system on the other end of that socket is using the same line endings as you). So if the remote system writes "\r\n" into the socket, then Perl will read "\r\n" from that socket. If it weren't for MacOS's bad design decisions, then this would be fairly easy to deal with. If you don't care about ignoring blank lines, then you can split on `/[\r\n]+/` to work around it. If you don't ever intend to run your code on a Mac, then you can split on `/\r\n\|\r/` to handle a wide variety of cases (unfortunately, finding a line ending of "\r\r\n" isn't that hard to do). If you don't even intend to run your code on a non-ASCII system, then you can split on `/\cM\cJ\|\cM/` and be happy even if your code is run on a Mac. - tye (but my friends call me "Tye")	[reply] [d/l] [select]
Re: line endings in remote documents by ehdonhon (Curate) on Dec 20, 2001 at 20:49 UTC
Yeah, MS platforms add a carriage return and a newline to the end of every line. You could try chomp()'ing all of the contents of @new. I'm not sure if chomp will get rid of the carriage returns or not. On a design note, it appears that your algorithm assumes that once a resource has been provided, it never goes away. By that, I mean that since you are always appending, you can't remove a resource by removing it from your website. That may not be an issue for you, but I thought I'd point it out.	[reply]
Re: Re: line endings in remote documents by belg4mit (Prior) on Dec 20, 2001 at 20:54 UTC
chomp will remove $/. Which will be set to \n\r by default on windoze. For cross-platformness try something like: split/(?:\n\|\r\|(?:\r\n))/. That should cover all that I know of (Unix \n, Microsoft \r\n and Macintosh \r). UPDATE: Fixed MS line endings sigh Je suis tired pantalons `-- perl -pe "s/\b;([st])/'\1/mg"`	[reply]
Re: Re: Re: line endings in remote documents by Juerd (Abbot) on Dec 20, 2001 at 21:17 UTC
The other way around :) Mac CR \015 \x0D \r DOS CRLF \015\012 \x0D\x0A \r\n *Nix LF \012 \x0A \n (Assuming \r is chr(13) and \n is chr(10), which isn't always true) The regex to substitute them all would be `s/\cM\|\cM\cJ\|\cJ/$foo/`, which can be simplified to `s/\cM\cJ?\|\cJ/$foo/`. But if you don't need to substitute, removing can be done a lot faster by just using `tr/\cM\cJ//d` (the /d will have tr/// delete characters not found in the replacement pattern (the replacenent pattern is empty in this example)). `2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$` [download]	[reply] [d/l]
Re: Re: Re: Re: line endings in remote documents by belg4mit (Prior) on Dec 20, 2001 at 21:19 UTC
Re: Re: Re: Re: Re: line endings in remote documents by Juerd (Abbot) on Dec 20, 2001 at 21:50 UTC
(tye)Re: line endings in remote documents by tye (Sage) on Dec 20, 2001 at 23:36 UTC
No, no, no!! $/ will be "\n" by default on Windows, just like it is (nearly?) everywhere else! - tye (but my friends call me "Tye")	[reply]
Re: (tye)Re: line endings in remote documents by Juerd (Abbot) on Dec 20, 2001 at 23:45 UTC
(tye)Re3: line endings in remote documents by tye (Sage) on Dec 21, 2001 at 00:03 UTC
Re: (tye)Re: line endings in remote documents by belg4mit (Prior) on Dec 21, 2001 at 03:21 UTC
Re: Re: Re: line endings in remote documents by premchai21 (Curate) on Dec 20, 2001 at 21:17 UTC
Or, if you're sure no \ns or \rs will appear in the middle of lines, you can do it even more simply: `split /[\n\r]+/, ...`	[reply] [d/l]