comment on

Hope I can get some advice on solving this problem. I have a file from a windows machine. It is encoded UTF16-LE with BOM of <FFFE> followed by text data. ALL data is in the format <4200> and the end of lines are <0d00> <0A00>. Each line is CVS. I need to read each line of the file, do some checking of some specific fields, and write to a new file some of the data with modifications. My problem is I cannot parse on the CR/LF. Below is a test script I have written (I am not an experience perl programmer) which shows the different approaches I have tried. All can read the file and all print the @array just fine, but none of them recognize the end of file. I have a small test file but I am not sure how to post it.

#!/usr/local/bin/perl 
#   
#
use strict;
use warnings;
use charnames qw( :full );

my @segment_array;

#use File::BOM();   #this tells script to use the Byte Order Mark in r
+eading the files, but it is not on the system I am using  

my $file_segment_name = "TestFile1.svd";

# examining the file in hex, it is utf8 encoded, with a Byte order Mar
+ker set at FFFE
#read  the files

#       open (FH_SEGMENT_FILE, "< $file_segment_name") || ERROR('open'
+, 'segment file');
#       open (FH_SEGMENT_FILE, '<:encoding(UTF16-LE)', $file_segment_n
+ame) || ERROR('open', 'segment file');
#       open (FH_SEGMENT_FILE, '<:raw:perlio:encoding(UTF16-LE):crlf',
+ $file_segment_name) || ERROR('open', 'segment file');
        open (FH_SEGMENT_FILE, '<  $file_segment_name' )|| ERROR('open
+', 'segment file');
#       binmode (FH_SEGMENT_FILE, '<:crlf: encoding(UTF16-LE) ' );
#       open (FH_SEGMENT_FILE, '<:raw:crlf: encoding(UTF16-LE) ', $fil
+e_segment_name );
#       open (FH_SEGMENT_FILE, '< :crlf :encoding(UTF16)', $file_segme
+nt_name);
        @segment_array=<FH_SEGMENT_FILE>;
        close(FH_SEGMENT_FILE);

#print the file - it prints correctly 
print "@segment_array";

print "\n\n";  #put some spaces in


for (my $i = 1; $i <=20 ; $i++){

        my $segment_array= shift(@segment_array);;
        print "$segment_array[$i]";
        }

exit;

#subs below this point
#************************
#-------------------------
sub ERROR () {
        print "Sever can't $_[0] the $_[1] \n";
}

#----------------------------
[download]

I don't know how to post the file and keep the encoding. So below is some of the file displayed using vi in the hex mode.


0000000: fffe 4000 4100 6900 7200 4d00 6100 6700  ..@.A.i.r.M.a.g.
0000010: 6e00 6500 7400 2000 5300 7500 7200 7600  n.e.t. .S.u.r.v.
0000020: 6500 7900 2000 4400 6100 7400 6100 0d00  e.y. .D.a.t.a...
0000030: 0a00 2300 5400 7900 7000 6500 3a00 2000  ..#.T.y.p.e.:. .
0000040: 7000 6100 7300 7300 6900 7600 6500 0d00  p.a.s.s.i.v.e...
0000050: 0a00 2300 4100 7000 7000 2000 5600 6500  ..#.A.p.p. .V.e.
0000060: 7200 7300 6900 6f00 6e00 3a00 2000 3800  r.s.i.o.n.:. .8.
0000070: 2e00 3200 2000 0900 2000 4200 7500 6900  ..2. ... .B.u.i.
0000080: 6c00 6400 3a00 2000 3200 3500 3400 3600  l.d.:. .2.5.4.6.
0000090: 3000 0d00 0a00 2300 4300 7200 6500 6100  0.....#.C.r.e.a.
00000a0: 7400 6500 6400 2000 6f00 6e00 3a00 2000  t.e.d. .o.n.:. .
00000b0: 3000 3900 3a00 3100 3300 3a00 3400 3700  0.9.:.1.3.:.4.7.
00000c0: 2000 3000 3400 2f00 3100 3000 2f00 3200   .0.4./.1.0./.2.
00000d0: 3000 3100 3200 0d00 0a00 2300 4300 6100  0.1.2.....#.C.a.
00000e0: 7200 6400 2000 4e00 6100 6d00 6500 2a00  r.d. .N.a.m.e.*.
00000f0: 3a00 2000 5500 6200 6900 7100 7500 6900  :. .U.b.i.q.u.i.
0000100: 7400 6900 2000 4e00 6500 7400 7700 6f00  t.i. .N.e.t.w.o.
0000110: 7200 6b00 7300 2000 5300 5200 2d00 3700  r.k.s. .S.R.-.7.
0000120: 3100 2d00 5500 5300 4200 2000 5700 6900  1.-.U.S.B. .W.i.
0000130: 7200 6500 6c00 6500 7300 7300 2000 4100  r.e.l.e.s.s. .A.
0000140: 6400 6100 7000 7400 6500 7200 2000 3000  d.a.p.t.e.r. .0.
0000150: 3000 3a00 3100 3500 3a00 3600 4400 3a00  0.:.1.5.:.6.D.:.
0000160: 3800 3400 3a00 4500 3100 3a00 4600 4100  8.4.:.E.1.:.F.A.
0000170: 0900 2000 4f00 5300 5600 6500 7200 7300  .. .O.S.V.e.r.s.
0000180: 6900 6f00 6e00 3a00 2000 3600 2e00 3100  i.o.n.:. .6...1.
[download]

when i run the program, the print @array looks like this:


@AirMagnet Survey Data
 #Type: passive
 #App Version: 8.2      Build: 25460
 #Created on: 09:13:47 04/10/2012
 #Card Name*: Ubiquiti Networks SR-71-USB Wireless Adapter 00:15:6D:84
+:E1:FA     OSVersion: 6.100002 1
 #Antenna Angle: 0.000000, Antenna Type: 
 #dim_X, dim_Y, GPS Map
 &,6351.008789,3142.447021, 1
 #Time,Xpos,Ypos,Channel,SSID,AP,SignalDBM,Signal,NoiseDBM,Noise,Media
+Type,NodeName,Speed,ByteCount(throughput),PacketCount,PacketLost,Lost
+Rate,RetryCount,RetryRate,Longitude,Latitude,Click,APFlags,MCSRx-Tx,I
+PerfSpeed,Heading, AntennaDirection, iPerf_Throughput_Up, iPerf_Throu
+ghput_Down
 1334063627,4144.148438,1767.801514,11,'xfinitywifi','C4:0A:CB:68:B9:8
+1',-80,20,-94,1,'802.11gn','X1G025_W004','0','-1','-1','-1','-1','-1'
+,'-1',-7311.503300, 4051.325100,*,131,3855,0,0.000000, 0.000000
[download]

but the second section ALWAYS looks like this. Alternate lines are missed

#App Version: 8.2      Build: 25460
#Card Name*: Ubiquiti Networks SR-71-USB Wireless Adapter 00:15:6D:84:
+E1:FA     OSVersion: 6.100002 1
#dim_X, dim_Y, GPS Map
#Time,Xpos,Ypos,Channel,SSID,AP,SignalDBM,Signal,NoiseDBM,Noise,MediaT
+ype,NodeName,Speed,ByteCount(throughput),PacketCount,PacketLost,LostR
+ate,RetryCount,RetryRate,Longitude,Latitude,Click,APFlags,MCSRx-Tx,IP
+erfSpeed,Heading, AntennaDirection, iPerf_Throughput_Up, iPerf_Throug
+hput_Down
1334063627,4144.148438,1767.801514,11,'optimumwifi','C4:0A:CB:68:B9:80
+',-80,20,-94,1,'802.11gn','X1G025_W004','0','-1','-1','-1','-1','-1',
+'-1',-7311.503300, 4051.325100,*,131,3855,0,0.000000, 0.000000
1334063627,4144.148438,1767.801514,6,'Smithtown','0C:D5:02:68:50:3F',-
+87,12,-94,1,'802.11g','0C:D5:02:68:50:3F','0','-1','-1','-1','-1','-1
+','-1',-7311.503300, 4051.325100,*,1,0,0,0.000000, 0.000000
1334063627,4144.148438,1767.801514,11,'Unknown','98:FC:11:90:FA:D0',-8
+9,9,-94,1,'802.11gn','98:FC:11:90:FA:D0','0','-1','-1','-1','-1','-1'
+,'-1',-7311.503300, 4051.325100,*,131,3855,0,0.000000, 0.000000
[download]

In reply to Problems parsing UTF16 file by stu23

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.