When is a trailing space not a trailing space?

markguy has asked for the wisdom of the Perl Monks concerning the following question:

s/\s+$// has failed me for the first time and I'll be damned if I can see why.

I'm slurping up a remote HTML file and parsing it. I've got lines that are basically being jammed into a db field, but they're carrying trailing spaces, which is bad. I'm doing all the usual tricks (chomp, strip trailing spaces, assume Windows format), but I can't get rid of them.

The only clue I have is when I view hidden characters in an editor, there are spaces at the end of the line... not something that's recognized by the editor as a space, mind you, just spaces with no character representation. Not a paragraph end, not a space character... just a blank space.

Has anyone run into this and/or have something else for me to try? List of what I'm running the lines through below for reference:

chomp line
s/\s+$//
s/\r//
abbreviated form of the demoronizer, to catch goofy Windows symbols

UPDATE: It was ASCII 160 that was causing the problem. I'm going to follow the suggestion to just strip out all control characters, I think. Appreciate the speedy suggestions and help.

Comment on When is a trailing space not a trailing space? Select or Download Code

Replies are listed 'Best First'.
Re: When is a trailing space not a trailing space? by japhy (Canon) on Oct 17, 2005 at 14:56 UTC
Run your HTML file through od or some other tool that displays the numerical value of each character. My guess (possibly wrong) is that it's the non-breaking space character, which is not matched by `\s` (and shouldn't be, in my opinion). Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l]
Re^2: When is a trailing space not a trailing space? by markguy (Scribe) on Oct 17, 2005 at 15:03 UTC
It absolutely shouldn't match if it's not a space character... I've just never had so much trouble seeing what the character is and getting rid of it. It makes me wonder what else has snuck into tables I've been working with all because the invisible character wasn't at the end where I'm regularly look to clean up. In any case, I'll give od a shot... thanks!	[reply]
Re^3: When is a trailing space not a trailing space? by bluto (Curate) on Oct 17, 2005 at 16:40 UTC
It makes me wonder what else has snuck into tables I've been working with all because the invisible character wasn't at the end where I'm regularly look to clean up. Rather than cutting out unwanted characters, you may want to only match what you know to be ok for the entire line. For example, if you want only printable non-whitespace characters, then match only that (see "perldoc perlre" for more details). Also, you may want to warn/die if you find characters in a line that you don't think should be there.	[reply]
Re: When is a trailing space not a trailing space? by ikegami (Patriarch) on Oct 17, 2005 at 15:15 UTC
To find out which character it is (assuming it's the last character in the line), do `printf("%X", ord(substr($line, -1)), "\n");` Use that number instead of ### to remove that character along with whitespace: `$line =~ s/[\s\x{###}]+$//;` Of course, you could remove all trailing control characters and whitespace: `$line =~ s/[\s\x00-\x1F\x7F]+$//;` There's also a "print" character class that could help you, I believe. By the way, \r is included in \s: `$line = "\n"; print(length($line), "\n"); $line =~ s/\s+$//; print(length($s), "\n");` [download]	[reply] [d/l] [select]
Re: When is a trailing space not a trailing space? by Roy Johnson (Monsignor) on Oct 17, 2005 at 14:57 UTC
Are you doing `s/\s+$//mg`? You say you're slurping the file and parsing it, which suggests that you're handling multiple lines at once, so you need to do multi-line matching. Alternatively, is it possible that your output routine is padding things? Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^2: When is a trailing space not a trailing space? by markguy (Scribe) on Oct 17, 2005 at 15:07 UTC
It's a line-by-line parse. Each line is a field basically, although of course there are fun-filled exceptions to this "rule". For good measure, I was stripping each potential field just before displaying it. And the only output routine I have going at the moment is `print`, so I'm going to have to trust it's something in the file. Appreciate the help in any case!	[reply] [d/l]
Re: When is a trailing space not a trailing space? by pajout (Curate) on Oct 17, 2005 at 14:56 UTC
Could you open the HTML file in some hexa editor, to see, what really is on the ends of lines?	[reply]
Re^2: When is a trailing space not a trailing space? by markguy (Scribe) on Oct 17, 2005 at 14:59 UTC
The editor I'm using (HTML-Kit) claims it will show any character that's there. Is there a particular editor that would... I don't know... actually show the character? :)	[reply]
Re^3: When is a trailing space not a trailing space? by pajout (Curate) on Oct 17, 2005 at 15:05 UTC
The idea is not show the graphic representation of weird characters, but it's codes (numbers, byte representations), for instance 10, 13 for win end of line, 160 for nonbreakable space etc. If you are on the linux, use 'od'.	[reply]
Re^3: When is a trailing space not a trailing space? by gri6507 (Deacon) on Oct 17, 2005 at 15:04 UTC
HexEdit is always a good choice for Macs. For Linux KHexEdit is good. For Windows, HexEdit is alright.	[reply]
Re: When is a trailing space not a trailing space? by Your Mother (Archbishop) on Oct 17, 2005 at 19:58 UTC
bluto sort of suggested this already. For human readable files (not binary/data/etc) this should be a good fix and can be used in other places too. `s/[^[:print:]]+\z//;`	[reply] [d/l]