agaved has asked for the wisdom of the Perl Monks concerning the following question:

I have >5000 textual files generated in Windows from PDF files that I need to process on a Mac OS X machine. I run dos2unix on all of them to correct the newline and to convert the encoding from UTF-16LE to UTF-8.

In 4949 cases everything goes fine, but for 320 files dos2unix skips the executions saying they are binary files.

This is coherent with of file -c that gives me data for the 320 skipped files and text for the other files. However they are text from a visual inspection ...

How can I repair the 320? At first I suspected it was the presence of the BOM, but it appears also on the files that don't give problems.

Furthermore, both the data and the text files start with:
0000000 ff fe 3d 00 20 00 70 00 61 00 67 00 65 00 20 00 0000010 31 00 20 00 3d 00 0a 00 0d 00 0d 00 0a 00

I know this ha nothing strictly to do with Perl, but I am really lost here ...

Any hint? Thanks in advance.

Replies are listed 'Best First'.
Re: Weird file type problems transferring from Windows to Mac OS
by choroba (Cardinal) on May 28, 2013 at 14:07 UTC
    You did not give enough information. Running your data through file:
    echo '0000000 ff fe 3d 00 20 00 70 00 61 00 67 00 65 00 20 00 0000010 31 00 20 00 3d 00 0a 00 0d 00 0d 00 0a 00' | xxd -r | file -

    Output:

    Little-endian UTF-16 Unicode text, with CR, LF line terminators

    Can you try removing characters from the end of a misbehaving file until it becomes "text", then post the offending character?

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      I would assume that the file starts with an UTF-BOM:

      ff fe .. ..

      Maybe just removing the first two characters "fixes" things, or alternatively BOM, and Encode...

        Thanks, but I already checked that. There are files with BOM that are processed with no problem ...
      Choroba

      Thanks for the debugging tip.

      Running hexdump -n NNN BAD.txt | xxd -r | file -, with NNN up to 1506 the result is text, at 1508 gives data.

      Running hexdump -n 1520 BAD.txt I get:
      0000500 55 00 c7 00 c3 00 4f 00 20 00 2e 00 2e 00 2e 00 0000510 2e 00 2e 00 2e 00 2e 00 2e 00 2e 00 2e 00 2e 00 * 00005e0 2e 00 20 00 34 00 20 00 20 00 0d 00 0a 00 32 00 00005f0

      strange that the weird character appears before 1508, though ...

        Try using iconv instead of dos2unix to convert the Unicode character encoding scheme of the files. The gremlins in the text might be something like improperly unpaired UTF-16 surrogate characters. I would expect iconv to handle these anomalies better than dos2unix. (It should at least warn you about them in the default case.)

        Just a thought…

        UPDATE: Another good tool for diagnosing peculiar and elusive character encoding problems is BabelPad.

Re: Weird file type problems transferring from Windows to Mac OS
by hdb (Monsignor) on May 28, 2013 at 14:06 UTC