http://qs1969.pair.com?node_id=845118

mmittiga17 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I hope someone can point me in the right direction or advise solution. I have a file under UltraEdit shows as a U-DOS, under TextWrangler it shows as UTF-16 EL. If I view the file in linux using vi, line ends in ^M. I am trying to convert it to DOS format using Perl. I am running This is ActiveState perl, v5.10.0 on a win box. Nothing seems to work for making this file ASCII DOS. I have tried s/\n/\r\n/; which does not work. Any suggestion or help would be greatly appreciated.

Replies are listed 'Best First'.
Re: U-DOS to DOS file conversion
by ikegami (Patriarch) on Jun 16, 2010 at 23:11 UTC

    under TextWrangler it shows as UTF-16 EL

    Then it's probably UCS-2le or UTF-16le. (The latter is a superset of the former.)

    I am trying to convert it to DOS format using Perl.

    Most people would consider "DOS format" to mean encoded using their machine's "ANSI" encoding and using CRLF for line endings.

    In the Western world, the "ANSI" encoding is usually Windows-1252 aka cp1252.

    perl -pe"BEGIN { binmode STDIN, ':encoding(UTF-16le)'; binmode STDOUT, + ':encoding(cp1252)'; }" < file.wide > file.ansi

    Update:

    The :crlf layer ends up in the incorrect order. That's not a problem with ASCII-derived encodings, but it is with UTF-16le. You actually need to use a workaround like

    # wide_to_ansi.pl file.wide file.ansi @ARGV == 2 or die("Incorrect usage\n"); open(my $fh_in, '<:raw:perlio:encoding(UTF-16le):crlf', $ARGV[0]) or die("Cannot open input file \"$ARGV[0]\": $!\n"); open(my $fh_out, '>:raw:perlio:encoding(cp1252):crlf', $ARGV[1]) or die("Cannot create output file \"$ARGV[1]\": $!\n"); print($fh_out $_) while <$fh_in>;

      Thank you so much ikegami. That solved my problem and know I know.

      how to detect the file format whether it is U-DOS or DOS format via perl code

        If you mean UTF-16le by "U-DOS", you can check for the presence of 00 bytes. Each ASCII character will be encoded in the form xx 00.