Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Invisible characters

by lomSpace (Scribe)
on Feb 23, 2010 at 21:43 UTC ( [id://824951]=perlquestion: print w/replies, xml ) Need Help??

lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello High Perl Monks!
I have attempted to split a file based on a comma.
This should work, but I think that there are some return
characters that I am not accounting for in my script.
#!/usr/bin/perl -w use strict; use Data::Dumper; # open a filehandle # file being read from open(my $in, "/Users/myDir/Desktop/hassan2.txt"); open(my $out, ">/Users/myDir/Desktop/hassan_out.txt"); while(<$in>){ my @fields = split /,/; my $field1 = $fields[0]; my $field2 = $fields[1]; print $out "$field1\t$field2\n"; } close($in); close($out);
The file that I am processing looks like this:
"-0.500, 4.502e-6" "-0.499, 4.474e-6" "-0.498, 4.458e-6" "-0.497, 4.445e-6" "-0.496, 4.433e-6" "-0.495, 4.421e-6"
And my output looks like this:
"-0.500 4.502e-6" "-0.499

When it should look like this:
-0.499 4.474e-6 -0.498 4.458e-6 -0.497 4.445e-6 -0.496 4.433e-6 -0.495 4.421e-6

Any ideas?

Replies are listed 'Best First'.
Re: Invisible characters
by toolic (Bishop) on Feb 23, 2010 at 21:53 UTC
    You could try Tip #3 from the Basic debugging checklist: Check for unprintable characters by converting them into their ASCII hex codes using ord
    my $copy = $field1; $copy =~ s/([^\x20-\x7E])/sprintf '\x{%02x}', ord $1/eg; print ":$copy:\n";
Re: Invisible characters
by ramlight (Friar) on Feb 23, 2010 at 22:11 UTC
    I suspect that toolic has the right idea. Your basic code is fine; I've tried it on my system and it works for me:
    #!/usr/bin/perl -w use strict; use Data::Dumper; # open a filehandle # file being read from while(<DATA>){ my @fields = split /,/; my $field1 = $fields[0]; my $field2 = $fields[1]; print "$field1\t$field2\n"; } __DATA__ "-0.500, 4.502e-6" "-0.499, 4.474e-6" "-0.498, 4.458e-6" "-0.497, 4.445e-6" "-0.496, 4.433e-6" "-0.495, 4.421e-6"
    gives for output
    "-0.500 4.502e-6" "-0.499 4.474e-6" "-0.498 4.458e-6" "-0.497 4.445e-6" "-0.496 4.433e-6" "-0.495 4.421e-6"
    So the best bet is that you do, as your title says, have some invisible characters hiding in there. So follow toolic's advice and you'll smoke them out.
Re: Invisible characters
by Marshall (Canon) on Feb 24, 2010 at 00:24 UTC
    This is sort of tricky because it just looks like it is working...First problem is that the \n is being kept, that's why ramlight's printout has an extra blank line in it. One way is to remove the \n is with chomp(). But there is another problem, because you only split on the "," character, the space before the 2nd token is being kept with it. I modified your print statement to show this:
    print "\"$field1\"\t\"$field2\"\n"; output... I took out "" in data input because it was too confusing a printout. "-0.500" " 4.502e-6" "-0.499" " 4.474e-6" "-0.498" " 4.458e-6" "-0.497" " 4.445e-6" "-0.496" " 4.433e-6" "-0.495" " 4.421e-6"
    One thing that could be done is split on any sequence of whitespace chars or the ",", like this:
    my @fields = split /[\s,]+/;
    That would handle extra "blank type" chars like \t. It also appears that you have "'s in input that you don't want. One way to get rid of them would be s/"//g; which just deletes them all.

    I suspect that you will find some unprintable character as toolic suggests. Of course one way to deal with this is just modify the regex that gets rid of the " char to deal with any characters that we don't want to see: s/[^\w\.\,\-]//g;, that gets rid of anything not contained within this set which would include the ",\n chars also it possible to use "tr" for the same purpose. tr is a simple minded thing and as such it is faster than the s/// type operation. But fixing the input file is better as this "weirdness" can fester and propagate.

    Finally, you can use array slice an get rid of the @fields intermediate variable. So some code with various possibilities...Hope some combination of ideas work for you.

    while(<DATA>){ chomp(); #optional #s/[^\w\.\,\-]//g; tr /0-9.-e ,//dc; #another possibility.. my ($field1, $field2) = (split /[\s,]+/)[0,1]; print "\"$field1\"\t\"$field2\"\n"; } __DATA__ "-0.500, 4.502e-6" "-0.499, 4.474e-6" "-0.498, 4.458e-6" "-0.497, 4.445e-6" "-0.496, 4.433e-6" "-0.495, 4.421e-6"
      Hi Marshall,
      The input file is an excel file saved as a text document. I am using a Mac to run the code.
      Any suggestions concerning the input file formatting?
      Lom Space
        Hi Lom Space!
        I take it that Excel is running on your Mac that the Perl code is also running on the Mac?

        There can be problems with transferring text files between: Mac,Unix and Windows because there are different "line termination" sequences. Mac uses \r, Unix \n, Windows \r\n so there can be some "weirdness".

        Put the chomp(); in the code. I've found Perl to be pretty smart about dealing with line termination issues. I haven't seen Excel export a "weirdo character" when doing a text export. If you can "cat" the file, then Perl can read it. Put the chomp() in and then just print without any splits or whatever. That should work.

        Now you may be exporting this spreadsheet as a .CSV file, which means "Comma Separated Value". Parsing this type of format is one of those things that appears easy, but is not so easy. There a number of Perl modules that deal with CSV but that doesn't appear to be your main problem. A CSV file is a text file.

        Put the chomp() in and then just print the data lines and see if that works.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://824951]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-24 06:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found