in reply to Remove new line characters

Newline characters would not appear as "boxes" in any display tool. The boxes represent some other character (probably outside the ASCII range) for which the display tool's current font has no glyph.

So the question is, what are these extra characters in the html data, which are not newlines and are not displayable characters? Here's a way to find out:

$line = $array_value; # but where does $array_value come from? $line =~ s/([^\x20-\x7e])/sprintf( "\\x%02x", ord( $1 ))/eg; print $line;
Assuming that your $array_value has not been flagged as containing utf8 character data, the substitution above will replace all "invisible" byte values (including those between 128 and 255) with their hexadecimal numerics (e.g. linefeed will show up as "\x0a", carriage-return as "\x0d", "delete" as "\x7f", non-breaking space as "\xa0" and so on).

If the string does contain utf8 characters (and perl has flagged it as such), it should still work, but some of the hexadecimal values may be 3- or 4-digit numbers.

Once you know what sorts of characters you're dealing with, you'll have a better idea of how to handle them.

Replies are listed 'Best First'.
Re^2: Remove new line characters
by cdarke (Prior) on Apr 17, 2007 at 07:18 UTC
    notepad.exe displays new-line characters as 'boxes' if not prefixed by '\r'. Although it could be argued that it hardly qualifies as a 'display tool'. wordpad.exe can handle this format.
    I would have thought that the simplest (though not necessarily most efficient) way of getting rid of new-lines at the end of text was:
    () while chomp $line;
         () while chomp $line;

      That would only work if $/ happens to match whatever is at the end of the line. If it really is just a matter of one or more "\n" with no preceding "\r", then you would do:

      s/(?<!\r)\n+//g; # should probably leave "\r\n" as-is
      which would work no matter what $/ happens to be.

      As for notepad.exe displaying a box to represent "\n" when it is not preceded by "\r", that's an interesting point that I was not aware of. But I would still expect that it also uses the box to represent other code points for which the current font does not have a defined glyph -- in other words, the box is ambiguous: there might be a variety of different byte values (code points) that would cause it to appear.

      And yes, notepad.exe is a display tool, in the sense that it makes data visible to the user. But since it's also a text editor, it makes certain assumptions about what the user is interested in seeing. (And the same goes for wordpad.exe)

Re^2: Remove new line characters
by Anonymous Monk on Apr 17, 2007 at 06:20 UTC
    If you really don't care what the extra characters are you could use: tr/\n\t -~//cd; I like graffs solution better though.
      Hi, I tried this, but to no avail. Thanks for the suggestion though!
Re^2: Remove new line characters
by simatics (Initiate) on Apr 18, 2007 at 05:34 UTC
    This did the trick!

    I used your code to figure out what the 'square' character was - it was represented by \x0d and \x0a.

    then I deleted them using:
    $line =~ tr/\x0d//d; $line =~ tr/\x0a//d;
    Thanks for all the help everyone!
      You can simplify that:
      $line =~ tr/\x0a\x0d//d;