markguy has asked for the wisdom of the Perl Monks concerning the following question:

s/\s+$// has failed me for the first time and I'll be damned if I can see why.

I'm slurping up a remote HTML file and parsing it. I've got lines that are basically being jammed into a db field, but they're carrying trailing spaces, which is bad. I'm doing all the usual tricks (chomp, strip trailing spaces, assume Windows format), but I can't get rid of them.

The only clue I have is when I view hidden characters in an editor, there are spaces at the end of the line... not something that's recognized by the editor as a space, mind you, just spaces with no character representation. Not a paragraph end, not a space character... just a blank space.

Has anyone run into this and/or have something else for me to try? List of what I'm running the lines through below for reference:

UPDATE: It was ASCII 160 that was causing the problem. I'm going to follow the suggestion to just strip out all control characters, I think. Appreciate the speedy suggestions and help.

Replies are listed 'Best First'.
Re: When is a trailing space not a trailing space?
by japhy (Canon) on Oct 17, 2005 at 14:56 UTC
    Run your HTML file through od or some other tool that displays the numerical value of each character. My guess (possibly wrong) is that it's the non-breaking space character, which is not matched by \s (and shouldn't be, in my opinion).

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

      It absolutely shouldn't match if it's not a space character... I've just never had so much trouble *seeing* what the character is and getting rid of it.

      It makes me wonder what else has snuck into tables I've been working with all because the invisible character wasn't at the end where I'm regularly look to clean up.

      In any case, I'll give od a shot... thanks!

        It makes me wonder what else has snuck into tables I've been working with all because the invisible character wasn't at the end where I'm regularly look to clean up.

        Rather than cutting out unwanted characters, you may want to only match what you know to be ok for the entire line. For example, if you want only printable non-whitespace characters, then match only that (see "perldoc perlre" for more details). Also, you may want to warn/die if you find characters in a line that you don't think should be there.

Re: When is a trailing space not a trailing space?
by ikegami (Patriarch) on Oct 17, 2005 at 15:15 UTC

    To find out which character it is (assuming it's the last character in the line), do
    printf("%X", ord(substr($line, -1)), "\n");
    Use that number instead of ### to remove that character along with whitespace:
    $line =~ s/[\s\x{###}]+$//;
    Of course, you could remove all trailing control characters and whitespace:
    $line =~ s/[\s\x00-\x1F\x7F]+$//;
    There's also a "print" character class that could help you, I believe.

    By the way, \r is included in \s:

    $line = "\n"; print(length($line), "\n"); $line =~ s/\s+$//; print(length($s), "\n");
Re: When is a trailing space not a trailing space?
by Roy Johnson (Monsignor) on Oct 17, 2005 at 14:57 UTC
    Are you doing s/\s+$//mg? You say you're slurping the file and parsing it, which suggests that you're handling multiple lines at once, so you need to do multi-line matching.

    Alternatively, is it possible that your output routine is padding things?


    Caution: Contents may have been coded under pressure.

      It's a line-by-line parse. Each line is a field basically, although of course there are fun-filled exceptions to this "rule". For good measure, I was stripping each potential field just before displaying it. And the only output routine I have going at the moment is print, so I'm going to have to trust it's something in the file.

      Appreciate the help in any case!

Re: When is a trailing space not a trailing space?
by pajout (Curate) on Oct 17, 2005 at 14:56 UTC
    Could you open the HTML file in some hexa editor, to see, what really is on the ends of lines?
      The editor I'm using (HTML-Kit) claims it will show any character that's there. Is there a particular editor that would... I don't know... *actually* show the character? :)
        The idea is not show the graphic representation of weird characters, but it's codes (numbers, byte representations), for instance 10, 13 for win end of line, 160 for nonbreakable space etc. If you are on the linux, use 'od'.
Re: When is a trailing space not a trailing space?
by Your Mother (Archbishop) on Oct 17, 2005 at 19:58 UTC

    bluto sort of suggested this already. For human readable files (not binary/data/etc) this should be a good fix and can be used in other places too.

    s/[^[:print:]]+\z//;