Hope I can get some advice on solving this problem. I have a file from a windows machine. It is encoded UTF16-LE with BOM of <FFFE> followed by text data. ALL data is in the format <4200> and the end of lines are <0d00> <0A00>. Each line is CVS. I need to read each line of the file, do some checking of some specific fields, and write to a new file some of the data with modifications. My problem is I cannot parse on the CR/LF. Below is a test script I have written (I am not an experience perl programmer) which shows the different approaches I have tried. All can read the file and all print the @array just fine, but none of them recognize the end of file. I have a small test file but I am not sure how to post it.

#!/usr/local/bin/perl # # use strict; use warnings; use charnames qw( :full ); my @segment_array; #use File::BOM(); #this tells script to use the Byte Order Mark in r +eading the files, but it is not on the system I am using my $file_segment_name = "TestFile1.svd"; # examining the file in hex, it is utf8 encoded, with a Byte order Mar +ker set at FFFE #read the files # open (FH_SEGMENT_FILE, "< $file_segment_name") || ERROR('open' +, 'segment file'); # open (FH_SEGMENT_FILE, '<:encoding(UTF16-LE)', $file_segment_n +ame) || ERROR('open', 'segment file'); # open (FH_SEGMENT_FILE, '<:raw:perlio:encoding(UTF16-LE):crlf', + $file_segment_name) || ERROR('open', 'segment file'); open (FH_SEGMENT_FILE, '< $file_segment_name' )|| ERROR('open +', 'segment file'); # binmode (FH_SEGMENT_FILE, '<:crlf: encoding(UTF16-LE) ' ); # open (FH_SEGMENT_FILE, '<:raw:crlf: encoding(UTF16-LE) ', $fil +e_segment_name ); # open (FH_SEGMENT_FILE, '< :crlf :encoding(UTF16)', $file_segme +nt_name); @segment_array=<FH_SEGMENT_FILE>; close(FH_SEGMENT_FILE); #print the file - it prints correctly print "@segment_array"; print "\n\n"; #put some spaces in for (my $i = 1; $i <=20 ; $i++){ my $segment_array= shift(@segment_array);; print "$segment_array[$i]"; } exit; #subs below this point #************************ #------------------------- sub ERROR () { print "Sever can't $_[0] the $_[1] \n"; } #----------------------------

I don't know how to post the file and keep the encoding. So below is some of the file displayed using vi in the hex mode.

0000000: fffe 4000 4100 6900 7200 4d00 6100 6700 ..@.A.i.r.M.a.g. 0000010: 6e00 6500 7400 2000 5300 7500 7200 7600 n.e.t. .S.u.r.v. 0000020: 6500 7900 2000 4400 6100 7400 6100 0d00 e.y. .D.a.t.a... 0000030: 0a00 2300 5400 7900 7000 6500 3a00 2000 ..#.T.y.p.e.:. . 0000040: 7000 6100 7300 7300 6900 7600 6500 0d00 p.a.s.s.i.v.e... 0000050: 0a00 2300 4100 7000 7000 2000 5600 6500 ..#.A.p.p. .V.e. 0000060: 7200 7300 6900 6f00 6e00 3a00 2000 3800 r.s.i.o.n.:. .8. 0000070: 2e00 3200 2000 0900 2000 4200 7500 6900 ..2. ... .B.u.i. 0000080: 6c00 6400 3a00 2000 3200 3500 3400 3600 l.d.:. .2.5.4.6. 0000090: 3000 0d00 0a00 2300 4300 7200 6500 6100 0.....#.C.r.e.a. 00000a0: 7400 6500 6400 2000 6f00 6e00 3a00 2000 t.e.d. .o.n.:. . 00000b0: 3000 3900 3a00 3100 3300 3a00 3400 3700 0.9.:.1.3.:.4.7. 00000c0: 2000 3000 3400 2f00 3100 3000 2f00 3200 .0.4./.1.0./.2. 00000d0: 3000 3100 3200 0d00 0a00 2300 4300 6100 0.1.2.....#.C.a. 00000e0: 7200 6400 2000 4e00 6100 6d00 6500 2a00 r.d. .N.a.m.e.*. 00000f0: 3a00 2000 5500 6200 6900 7100 7500 6900 :. .U.b.i.q.u.i. 0000100: 7400 6900 2000 4e00 6500 7400 7700 6f00 t.i. .N.e.t.w.o. 0000110: 7200 6b00 7300 2000 5300 5200 2d00 3700 r.k.s. .S.R.-.7. 0000120: 3100 2d00 5500 5300 4200 2000 5700 6900 1.-.U.S.B. .W.i. 0000130: 7200 6500 6c00 6500 7300 7300 2000 4100 r.e.l.e.s.s. .A. 0000140: 6400 6100 7000 7400 6500 7200 2000 3000 d.a.p.t.e.r. .0. 0000150: 3000 3a00 3100 3500 3a00 3600 4400 3a00 0.:.1.5.:.6.D.:. 0000160: 3800 3400 3a00 4500 3100 3a00 4600 4100 8.4.:.E.1.:.F.A. 0000170: 0900 2000 4f00 5300 5600 6500 7200 7300 .. .O.S.V.e.r.s. 0000180: 6900 6f00 6e00 3a00 2000 3600 2e00 3100 i.o.n.:. .6...1.

when i run the program, the print @array looks like this:

@AirMagnet Survey Data #Type: passive #App Version: 8.2 Build: 25460 #Created on: 09:13:47 04/10/2012 #Card Name*: Ubiquiti Networks SR-71-USB Wireless Adapter 00:15:6D:84 +:E1:FA OSVersion: 6.100002 1 #Antenna Angle: 0.000000, Antenna Type: #dim_X, dim_Y, GPS Map &,6351.008789,3142.447021, 1 #Time,Xpos,Ypos,Channel,SSID,AP,SignalDBM,Signal,NoiseDBM,Noise,Media +Type,NodeName,Speed,ByteCount(throughput),PacketCount,PacketLost,Lost +Rate,RetryCount,RetryRate,Longitude,Latitude,Click,APFlags,MCSRx-Tx,I +PerfSpeed,Heading, AntennaDirection, iPerf_Throughput_Up, iPerf_Throu +ghput_Down 1334063627,4144.148438,1767.801514,11,'xfinitywifi','C4:0A:CB:68:B9:8 +1',-80,20,-94,1,'802.11gn','X1G025_W004','0','-1','-1','-1','-1','-1' +,'-1',-7311.503300, 4051.325100,*,131,3855,0,0.000000, 0.000000

but the second section ALWAYS looks like this. Alternate lines are missed

#App Version: 8.2 Build: 25460 #Card Name*: Ubiquiti Networks SR-71-USB Wireless Adapter 00:15:6D:84: +E1:FA OSVersion: 6.100002 1 #dim_X, dim_Y, GPS Map #Time,Xpos,Ypos,Channel,SSID,AP,SignalDBM,Signal,NoiseDBM,Noise,MediaT +ype,NodeName,Speed,ByteCount(throughput),PacketCount,PacketLost,LostR +ate,RetryCount,RetryRate,Longitude,Latitude,Click,APFlags,MCSRx-Tx,IP +erfSpeed,Heading, AntennaDirection, iPerf_Throughput_Up, iPerf_Throug +hput_Down 1334063627,4144.148438,1767.801514,11,'optimumwifi','C4:0A:CB:68:B9:80 +',-80,20,-94,1,'802.11gn','X1G025_W004','0','-1','-1','-1','-1','-1', +'-1',-7311.503300, 4051.325100,*,131,3855,0,0.000000, 0.000000 1334063627,4144.148438,1767.801514,6,'Smithtown','0C:D5:02:68:50:3F',- +87,12,-94,1,'802.11g','0C:D5:02:68:50:3F','0','-1','-1','-1','-1','-1 +','-1',-7311.503300, 4051.325100,*,1,0,0,0.000000, 0.000000 1334063627,4144.148438,1767.801514,11,'Unknown','98:FC:11:90:FA:D0',-8 +9,9,-94,1,'802.11gn','98:FC:11:90:FA:D0','0','-1','-1','-1','-1','-1' +,'-1',-7311.503300, 4051.325100,*,131,3855,0,0.000000, 0.000000

In reply to Problems parsing UTF16 file by stu23

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.