Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am having an issue parsing a pipe delimited JBOSS log file. After reading it perl's output has what appears to be spaces between each character. It probably has something to do with character encoding, but file -i is reporting it as ASCII text, with very long lines. I also tried stripping out null bytes with tr. I've reproduced the issue both in Cygwin on my 64bit Windows machine and on a 64bit CentOS 7 host. The log is coming from a very stripped-down version of Redhat 7 running in a Docker container on Openshift. I'm kind of at a loss here. It's really driving me nuts. Please let me know if you have any ideas. Thanks much.

Replies are listed 'Best First'.
Re: Parsing issue (null bytes?)
by davido (Cardinal) on Sep 08, 2017 at 14:49 UTC

    Look at a hex dump of the character values and see what is not being accounted for in your code.

    If in linux, xxd filename.whatever

    At least with a look at the hex dump you should be able to determine if there are characters encoded above 0x7F. Without sample input and code it's kind of hard to guess what's happening.


    Dave

Re: Parsing issue (null bytes?)
by Corion (Patriarch) on Sep 08, 2017 at 14:48 UTC

    Most likely, your file is encoded as UTF-16.

    Try Encode::decode, or opening your file as UTF-16:

    open my $log, '<:encoding(UTF-16)', $logfilename or die "Couldn't read '$logfilename': $!";
Re: Parsing issue (null bytes?)
by Anonymous Monk on Sep 08, 2017 at 16:03 UTC
    Thanks guys. A hex dump revealed that there are definitely chars encoded above 0x7f, and reading it after opening it as UTF-16 returned UTF-16:Unrecognised BOM 5b64 at ./test line 11. UTF-32 fails as well. I will check out Encode::Decode, but if you have any suggestions on how to get around it, please let me know. Thanks for your help. Mark
      UTF-16:Unrecognised BOM 5b64 at ./test line 11. UTF-32 fails as well.

      For figuring out what encoding a file is, see my comments in Re: Converting UTF8 to ANSI and the replies (except for the stuff about File::BOM, that doesn't seem to apply here).

      Update:

      A hex dump revealed that there are definitely chars encoded above 0x7f

      Can you show us?

        Thanks haukex. It's guessing ascii. I can't really show much of the log without changing the data, but here is a bit:

        0000000: 5b64 6566 6175 6c74 2074 6173 6b2d 3335 [default task-35 0000010: 5d20 2049 4e46 4f20 7c20 3230 3137 2d30 ] INFO | 2017-0 0000020: 392d 3035 2031 313a 3233 3a33 352c 3931 9-05 11:23:35,91

        Hopefully I'm understanding it correctly.

Re: Parsing issue (null bytes?)
by Anonymous Monk on Sep 10, 2017 at 13:46 UTC
    Please post a relevant snippet(s) of your data ... sanitized as necessary ... and point out to us the offsets which appear to contain the anomalies of which you speak.