Invisible characters

lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello High Perl Monks!
I have attempted to split a file based on a comma.
This should work, but I think that there are some return
characters that I am not accounting for in my script.

#!/usr/bin/perl -w
use strict;
use Data::Dumper;
# open a filehandle
# file being read from
open(my $in, "/Users/myDir/Desktop/hassan2.txt");
open(my $out, ">/Users/myDir/Desktop/hassan_out.txt"); 

while(<$in>){
    my @fields = split /,/;
    my $field1 = $fields[0];
    my $field2  = $fields[1];
    print  $out "$field1\t$field2\n";
}
close($in);
close($out);
[download]

The file that I am processing looks like this:

"-0.500, 4.502e-6"
"-0.499, 4.474e-6"
"-0.498, 4.458e-6"
"-0.497, 4.445e-6"
"-0.496, 4.433e-6"
"-0.495, 4.421e-6"
[download]

And my output looks like this:

"-0.500     4.502e-6"
"-0.499
[download]

When it should look like this:

-0.499 4.474e-6
-0.498 4.458e-6
-0.497 4.445e-6
-0.496 4.433e-6
-0.495 4.421e-6
[download]

Any ideas?

Comment on Invisible characters Select or Download Code

Replies are listed 'Best First'.
Re: Invisible characters by toolic (Bishop) on Feb 23, 2010 at 21:53 UTC
You could try Tip #3 from the Basic debugging checklist: Check for unprintable characters by converting them into their ASCII hex codes using ord `my $copy = $field1; $copy =~ s/([^\x20-\x7E])/sprintf '\x{%02x}', ord $1/eg; print ":$copy:\n";` [download]	[reply] [d/l]
Re: Invisible characters by ramlight (Friar) on Feb 23, 2010 at 22:11 UTC
I suspect that toolic has the right idea. Your basic code is fine; I've tried it on my system and it works for me: `#!/usr/bin/perl -w use strict; use Data::Dumper; # open a filehandle # file being read from while(<DATA>){ my @fields = split /,/; my $field1 = $fields[0]; my $field2 = $fields[1]; print "$field1\t$field2\n"; } __DATA__ "-0.500, 4.502e-6" "-0.499, 4.474e-6" "-0.498, 4.458e-6" "-0.497, 4.445e-6" "-0.496, 4.433e-6" "-0.495, 4.421e-6"` [download] gives for output `"-0.500 4.502e-6" "-0.499 4.474e-6" "-0.498 4.458e-6" "-0.497 4.445e-6" "-0.496 4.433e-6" "-0.495 4.421e-6"` [download] So the best bet is that you do, as your title says, have some invisible characters hiding in there. So follow toolic's advice and you'll smoke them out.	[reply] [d/l] [select]
Re: Invisible characters by Marshall (Canon) on Feb 24, 2010 at 00:24 UTC
This is sort of tricky because it just looks like it is working...First problem is that the \n is being kept, that's why ramlight's printout has an extra blank line in it. One way is to remove the \n is with chomp(). But there is another problem, because you only split on the "," character, the space before the 2nd token is being kept with it. I modified your print statement to show this: `print "\"$field1\"\t\"$field2\"\n"; output... I took out "" in data input because it was too confusing a printout. "-0.500" " 4.502e-6" "-0.499" " 4.474e-6" "-0.498" " 4.458e-6" "-0.497" " 4.445e-6" "-0.496" " 4.433e-6" "-0.495" " 4.421e-6"` [download] One thing that could be done is split on any sequence of whitespace chars or the ",", like this: `my @fields = split /[\s,]+/;` [download] That would handle extra "blank type" chars like \t. It also appears that you have "'s in input that you don't want. One way to get rid of them would be `s/"//g;` which just deletes them all. I suspect that you will find some unprintable character as toolic suggests. Of course one way to deal with this is just modify the regex that gets rid of the " char to deal with any characters that we don't want to see: `s/[^\w\.\,\-]//g;`, that gets rid of anything not contained within this set which would include the ",\n chars also it possible to use "tr" for the same purpose. tr is a simple minded thing and as such it is faster than the s/// type operation. But fixing the input file is better as this "weirdness" can fester and propagate. Finally, you can use array slice an get rid of the @fields intermediate variable. So some code with various possibilities...Hope some combination of ideas work for you. `while(<DATA>){ chomp(); #optional #s/[^\w\.\,\-]//g; tr /0-9.-e ,//dc; #another possibility.. my ($field1, $field2) = (split /[\s,]+/)[0,1]; print "\"$field1\"\t\"$field2\"\n"; } __DATA__ "-0.500, 4.502e-6" "-0.499, 4.474e-6" "-0.498, 4.458e-6" "-0.497, 4.445e-6" "-0.496, 4.433e-6" "-0.495, 4.421e-6"` [download]	[reply] [d/l] [select]
Re^2: Invisible characters by lomSpace (Scribe) on Feb 24, 2010 at 19:26 UTC
Hi Marshall, The input file is an excel file saved as a text document. I am using a Mac to run the code. Any suggestions concerning the input file formatting? Lom Space	[reply]
Re^3: Invisible characters by Marshall (Canon) on Feb 25, 2010 at 15:15 UTC
Hi Lom Space! I take it that Excel is running on your Mac that the Perl code is also running on the Mac? There can be problems with transferring text files between: Mac,Unix and Windows because there are different "line termination" sequences. Mac uses \r, Unix \n, Windows \r\n so there can be some "weirdness". Put the chomp(); in the code. I've found Perl to be pretty smart about dealing with line termination issues. I haven't seen Excel export a "weirdo character" when doing a text export. If you can "cat" the file, then Perl can read it. Put the chomp() in and then just print without any splits or whatever. That should work. Now you may be exporting this spreadsheet as a .CSV file, which means "Comma Separated Value". Parsing this type of format is one of those things that appears easy, but is not so easy. There a number of Perl modules that deal with CSV but that doesn't appear to be your main problem. A CSV file is a text file. Put the chomp() in and then just print the data lines and see if that works.	[reply]


Just another Perl shrine
	PerlMonks