Remove new line characters

simatics has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Remove new line characters by graff (Chancellor) on Apr 17, 2007 at 04:31 UTC
Newline characters would not appear as "boxes" in any display tool. The boxes represent some other character (probably outside the ASCII range) for which the display tool's current font has no glyph. So the question is, what are these extra characters in the html data, which are not newlines and are not displayable characters? Here's a way to find out: `$line = $array_value; # but where does $array_value come from? $line =~ s/([^\x20-\x7e])/sprintf( "\\x%02x", ord( $1 ))/eg; print $line;` [download] Assuming that your $array_value has not been flagged as containing utf8 character data, the substitution above will replace all "invisible" byte values (including those between 128 and 255) with their hexadecimal numerics (e.g. linefeed will show up as "\x0a", carriage-return as "\x0d", "delete" as "\x7f", non-breaking space as "\xa0" and so on). If the string does contain utf8 characters (and perl has flagged it as such), it should still work, but some of the hexadecimal values may be 3- or 4-digit numbers. Once you know what sorts of characters you're dealing with, you'll have a better idea of how to handle them.	[reply] [d/l]
Re^2: Remove new line characters by cdarke (Prior) on Apr 17, 2007 at 07:18 UTC
notepad.exe displays new-line characters as 'boxes' if not prefixed by '\r'. Although it could be argued that it hardly qualifies as a 'display tool'. wordpad.exe can handle this format. I would have thought that the simplest (though not necessarily most efficient) way of getting rid of new-lines at the end of text was: `() while chomp $line;` [download]	[reply] [d/l]
Re^3: Remove new line characters by graff (Chancellor) on Apr 17, 2007 at 14:51 UTC
`() while chomp $line;` That would only work if $/ happens to match whatever is at the end of the line. If it really is just a matter of one or more "\n" with no preceding "\r", then you would do: `s/(?<!\r)\n+//g; # should probably leave "\r\n" as-is` [download] which would work no matter what $/ happens to be. As for notepad.exe displaying a box to represent "\n" when it is not preceded by "\r", that's an interesting point that I was not aware of. But I would still expect that it also uses the box to represent other code points for which the current font does not have a defined glyph -- in other words, the box is ambiguous: there might be a variety of different byte values (code points) that would cause it to appear. And yes, notepad.exe is a display tool, in the sense that it makes data visible to the user. But since it's also a text editor, it makes certain assumptions about what the user is interested in seeing. (And the same goes for wordpad.exe)	[reply] [d/l] [select]
Re^2: Remove new line characters by Anonymous Monk on Apr 17, 2007 at 06:20 UTC
If you really don't care what the extra characters are you could use: tr/\n\t -~//cd; I like graffs solution better though.	[reply]
Re^3: Remove new line characters by simatics (Initiate) on Apr 18, 2007 at 05:27 UTC
Hi, I tried this, but to no avail. Thanks for the suggestion though!	[reply]
Re^2: Remove new line characters by simatics (Initiate) on Apr 18, 2007 at 05:34 UTC
This did the trick! I used your code to figure out what the 'square' character was - it was represented by \x0d and \x0a. then I deleted them using: `$line =~ tr/\x0d//d; $line =~ tr/\x0a//d;` [download] Thanks for all the help everyone!	[reply] [d/l]
Re^3: Remove new line characters by graff (Chancellor) on Apr 18, 2007 at 05:37 UTC
You can simplify that: `$line =~ tr/\x0a\x0d//d;` [download]	[reply] [d/l]
Re: Remove new line characters by swampyankee (Parson) on Apr 17, 2007 at 04:05 UTC
Well, question one would be "How are you reading the file?" Please include the code which shows how you're reading the file. emc Insisting on perfect safety is for people who don't have the balls to live in the real world. —Mary Shafer, NASA Dryden Flight Research Center	[reply]
Re^2: Remove new line characters by simatics (Initiate) on Apr 18, 2007 at 05:23 UTC
This is the code I use. The HTML file has the data in three tables, so I simply push everything between TD tags into the array. $td_switch=0; open(DAT, $data_file)\|\| die("Could not open file!"); while (<DAT>){ if ($td_switch){ if (m/.?<\/td>.$/i){ ($extract_rest, $_) = /(.?)<\/td>(.)$/i; $td_switch = 0; $extract .= "\n" . $extract_rest; push (@parsed_data, $extract); #print "$_\n\n"; }else{ ($extract_rest) = /(.?)$/i; $extract .= "\n" . $extract_rest; } } while (m/<td.?>.?<\/td>/i){ ($extract, $_) = /.?<td.?>(.?)<\/td>(.)$/i; push (@parsed_data, $extract); #print "$_\n\n"; } if (m/<td.?>/i){ ($extract) = /.<td.?>(.*)$/i; $_ = ""; $td_switch = 1; #print "$_\n\n"; } } close(DAT); [download]	[reply] [d/l]
Re^3: Remove new line characters by graff (Chancellor) on Apr 18, 2007 at 06:11 UTC
This explains a few things that were not clear in your original question. You never mentioned what OS this is running on, nor what sort of tool you were using when you saw "boxes". As to the first point, I would expect you were using unix or linux; the data, having come from the web, presumably has "CRLF" ("\r\n", aka "\x0d\x0a") line termination. But in this script you posted, the "CR" character does not get removed on input (this would only happen if the perl script were running on a windows machine). Then, your use of "." (period) in the various regexes causes the CR to be included in the various strings that are captured and assigned to variables (period matches everything except "LF" = "\n" = "\x0a", so it matches CR). It was actually those residual CR characters that were showing up as boxes in your display. Some unix tools for viewing text data will do this, because if CR is rendered "literally", the resulting display can be misleading -- esp. if there are additional characters "on the same line" following the CR (i.e. between the CR and the next LF). Try running this one-liner in a normal terminal window, and see what the output looks like. Then run it again and redirect the output to a file, and view that file using whatever tool was displaying boxes in your other data. That should help you understand. `perl -e 'print " passed the test\r failed \n"'` [download]	[reply] [d/l]