simatics has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am reading data from an HTML file, and puting this data into an array.

One of the array values have three 'new line chars' that looks like a 'square' in the HTML file itself.

It's an address field, so the value looks like this in the file:

Address Line 1<square>
Address Line 2<square>
Address Line 3

When printed in the perl program, the array field looks like this:

Address Line 1
Address Line 2
Address Line 3

I would like to extract the data, but it is proving difficult.
Here is the code I am using:

$line = $array_value;<br> $_ = $line;<br> /^(.?)\n/;<br> print $1.$2.$3."\n";<br>

When I run the above script, I get "Address Line 1" printed. When I replace the reg.ex. with /^(.?)\n(.?)\n/ in an attempt to grab the second line, NOTHING is printed!


I also tried using the following code to remove the new line characters, to no avail:


$line =~ tr/\015//;<br> $line =~ tr/\n//;<br> $line =~ tr/\r//;<br>

Any help would be greatly appreciated.


Thanks!

Replies are listed 'Best First'.
Re: Remove new line characters
by graff (Chancellor) on Apr 17, 2007 at 04:31 UTC
    Newline characters would not appear as "boxes" in any display tool. The boxes represent some other character (probably outside the ASCII range) for which the display tool's current font has no glyph.

    So the question is, what are these extra characters in the html data, which are not newlines and are not displayable characters? Here's a way to find out:

    $line = $array_value; # but where does $array_value come from? $line =~ s/([^\x20-\x7e])/sprintf( "\\x%02x", ord( $1 ))/eg; print $line;
    Assuming that your $array_value has not been flagged as containing utf8 character data, the substitution above will replace all "invisible" byte values (including those between 128 and 255) with their hexadecimal numerics (e.g. linefeed will show up as "\x0a", carriage-return as "\x0d", "delete" as "\x7f", non-breaking space as "\xa0" and so on).

    If the string does contain utf8 characters (and perl has flagged it as such), it should still work, but some of the hexadecimal values may be 3- or 4-digit numbers.

    Once you know what sorts of characters you're dealing with, you'll have a better idea of how to handle them.

      notepad.exe displays new-line characters as 'boxes' if not prefixed by '\r'. Although it could be argued that it hardly qualifies as a 'display tool'. wordpad.exe can handle this format.
      I would have thought that the simplest (though not necessarily most efficient) way of getting rid of new-lines at the end of text was:
      () while chomp $line;
           () while chomp $line;

        That would only work if $/ happens to match whatever is at the end of the line. If it really is just a matter of one or more "\n" with no preceding "\r", then you would do:

        s/(?<!\r)\n+//g; # should probably leave "\r\n" as-is
        which would work no matter what $/ happens to be.

        As for notepad.exe displaying a box to represent "\n" when it is not preceded by "\r", that's an interesting point that I was not aware of. But I would still expect that it also uses the box to represent other code points for which the current font does not have a defined glyph -- in other words, the box is ambiguous: there might be a variety of different byte values (code points) that would cause it to appear.

        And yes, notepad.exe is a display tool, in the sense that it makes data visible to the user. But since it's also a text editor, it makes certain assumptions about what the user is interested in seeing. (And the same goes for wordpad.exe)

      If you really don't care what the extra characters are you could use: tr/\n\t -~//cd; I like graffs solution better though.
        Hi, I tried this, but to no avail. Thanks for the suggestion though!
      This did the trick!

      I used your code to figure out what the 'square' character was - it was represented by \x0d and \x0a.

      then I deleted them using:
      $line =~ tr/\x0d//d; $line =~ tr/\x0a//d;
      Thanks for all the help everyone!
        You can simplify that:
        $line =~ tr/\x0a\x0d//d;
Re: Remove new line characters
by swampyankee (Parson) on Apr 17, 2007 at 04:05 UTC

    Well, question one would be "How are you reading the file?" Please include the code which shows how you're reading the file.

    emc

    Insisting on perfect safety is for people who don't have the balls to live in the real world.

    —Mary Shafer, NASA Dryden Flight Research Center
      This is the code I use. The HTML file has the data in three tables, so I simply push everything between TD tags into the array.
      $td_switch=0; open(DAT, $data_file)|| die("Could not open file!"); while (<DAT>){ if ($td_switch){ if (m/.*?<\/td>.*$/i){ ($extract_rest, $_) = /(.*?)<\/td>(.*)$/i; $td_switch = 0; $extract .= "\n" . $extract_rest; push (@parsed_data, $extract); #print "$_\n\n"; }else{ ($extract_rest) = /(.*?)$/i; $extract .= "\n" . $extract_rest; } } while (m/<td.*?>.*?<\/td>/i){ ($extract, $_) = /.*?<td.*?>(.*?)<\/td>(.*)$/i; push (@parsed_data, $extract); #print "$_\n\n"; } if (m/<td.*?>/i){ ($extract) = /.*<td.*?>(.*)$/i; $_ = ""; $td_switch = 1; #print "$_\n\n"; } } close(DAT);
        This explains a few things that were not clear in your original question.

        You never mentioned what OS this is running on, nor what sort of tool you were using when you saw "boxes". As to the first point, I would expect you were using unix or linux; the data, having come from the web, presumably has "CRLF" ("\r\n", aka "\x0d\x0a") line termination. But in this script you posted, the "CR" character does not get removed on input (this would only happen if the perl script were running on a windows machine). Then, your use of "." (period) in the various regexes causes the CR to be included in the various strings that are captured and assigned to variables (period matches everything except "LF" = "\n" = "\x0a", so it matches CR).

        It was actually those residual CR characters that were showing up as boxes in your display. Some unix tools for viewing text data will do this, because if CR is rendered "literally", the resulting display can be misleading -- esp. if there are additional characters "on the same line" following the CR (i.e. between the CR and the next LF).

        Try running this one-liner in a normal terminal window, and see what the output looks like. Then run it again and redirect the output to a file, and view that file using whatever tool was displaying boxes in your other data. That should help you understand.

        perl -e 'print " passed the test\r failed \n"'