Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to pull an index from a flat file which appears to contain a strange character. This character does not appear in notepad, and chomp is not stripping it. Is anyone familiar this error? This is the character
0001
There is no carriage return in the file. What is the proper steps to stripping characters like this. Thank you

Replies are listed 'Best First'.
Re: Strange character beginning text files
by beable (Friar) on Jul 20, 2004 at 02:49 UTC
    You should try using the ord($ch) function to find out what the character is. It looks like it might be ASCII character number 1, but it's hard to tell on a webpage. If it is an unprintable ASCII character, you could use the tr operator to delete it:
    #!/usr/bin/perl use strict; use warnings; my $string = "\001hello\002world\003lookit\004these\005weird\006charac +ters"; my $string2 = $string; # replace ASCII chars from 0 to 8 with spaces $string =~ tr[\000-\010][ ]; # or delete weird chars: $string2 =~ tr[\000-\010][]d; print "string is $string\n"; print "string2 is $string2\n"; __END__

      To limit data to the ASCII printable set you can just do tr/\011\012\015\040-\176//cd You may want to lose or change one of the CR LF chars as well.

      Opps, forgot tab, thanks beable

      cheers

      tachyon

        Tut tut, sirrah. \011 is printable. From "man ascii":
        Oct Dec Hex Char 011 9 09 HT '\t'
        </nitpick>
Re: Strange character beginning text files
by crabbdean (Pilgrim) on Jul 20, 2004 at 04:37 UTC
    Well if chomp is not stripping it then its not a return character. You could try chop, if its always on the end of the line.

    Alternatively you could convert it to bits like this: map { print unpack "B*", chr } qw\0001\ Or, to just characters like this: map {print chr }qw/0001/ ... and then check it out against an ASCII table on the net and see what it converts to.try here

    Interestingly I ran this on the character and got a NULL string, that is, 00000000. Hence why chomp mightn't be picking it up. What you are seeing could be how your NULL appears in your flat file, which would also explain why its probably not showing in Notepad. Additionally when I attempt to convert it to a character using "chr" I get nothing appearing on my console, which also explains the possibility of NULL character as well.

    Furthermore, strings in memory are terminated with a NULL character, which the computer uses to signify the end of the string. If your flat file is the result of something that was written to it from another program the character could very well be NULL's at the end of each string.

    Again all this is hypothesis. How to remove them depends on where they are appearing in the flat file. If you are the creator of the flat file trying amended the program that writes it to chop the last character from each string/line before writing it to the flat file. Or convert the whole file to bits, delete all nulls, and then convert back to characters (probably not required unless you're really desperate).

    As a side note, the 1 on the end of this 0001 suggests to me if could also be the 00000001 character which is the "Start of heading" character unless your question is simply relating to the box () which for me comes out as 00000000


    Dean
    The Funkster of Mirth
    Programming these days takes more than a lone avenger with a compiler. - sam
    RFC1149: A Standard for the Transmission of IP Datagrams on Avian Carriers

      Actually chomp typically eats \n only which is the line feed char LF not the carriage return char CR....

      printf "CR \\r \\%03o 0x%02x\n", ord("\r"), ord("\r");; printf "LF \\n \\%03o 0x%02x\n\n", ord("\n"), ord("\n");; my $str = "str\015\012"; for( 1..2 ) { print "string '$str'\n"; print "length ", length $str, "\n"; chomp $str; print "string '$str'\n"; print "length ", length $str, "\n\n"; }

      Technically chomp removes any trailing string that corresponds to the current value of $/ (also known as $INPUT_RECORD_SEPARATOR in the English module).

      cheers

      tachyon

        Well, to be exact, chomp removes whatever string happens to match the current value of "$/" (input record separator), which defaults to "\015\012" for windows text-mode, "\n" for unix. (update: see replies below for correct info)

        And it only does this when the string matching $/ happens to occur at the end of the scalar value being chomped.

        perl -e '$/ = "\n"; $_ = "str\015\012"; chomp; s/(\s)/sprintf("%o",ord +($1))/eg; print $_,$/' # prints "str15" perl -e '$/ = "\r\n"; $_ = "str\015\012"; chomp; s/(\s)/sprintf("%o",o +rd($1))/eg; print $_,$/' # prints "str" perl -e '$/="\r\n"; $_ = "foo\015\012str\015\012"; chomp; s/(\s)/sprin +tf("%o",ord($1))/eg; print $_,$/' # prints "foo1512str"
        Update: Honest, I really did (start to) post this before tachyon made it redundant. And I confess I was not speaking from personal experience (lucky me) about the default value of $/ on ms-win -- thanks to tachyon for the correction.