Extract data from non-standard .xml file

BillKSmith has asked for the wisdom of the Perl Monks concerning the following question:

Several years ago, I wrote a program under windows XP, which fetches scores for the frecell game and computes additional statistics. This program does not work on my 64-bit windows 7 system because Microsoft has moved these scores from the registry to a file.

With a little help from google, I have found that this data is stored in an xml format and appended to a large (approx 72k) block of binary data. Characters are encoded as a 2-byte unicode (ASCII code in the first byte, null in the second) The sample below includes one of the required parameters: <GamesPlayed>451</Games_Played>.

I am able to read the file as a binary file, find and extract the data with a regular expression, and concatenate the digits to form the numeric result.

I would like a cleaner, more flexible approach. Note: I only need 'read' access. There is no requirement to change the scores.

Several XML modules return a hash of the xml data. This would be perfect but I have not found one that can skip over the irrelevant binary data and handle the backward character encoding. I do not know if the block of binary data has a fixed length. For now, I must assume that it does not.

Here is a hex dump of the start of the xml (made with xxd command of vim):

"C:\Users\Bill\AppData\Local\Microsoft Games\FreeCell\FreeCellSettings.xml"

00119c0: b324 2d18 bb60 6f83 47f8 8b61 648b 70e1  .$-..`o.G..ad.p.
00119d0: fdf3 f7ff 070e 990a 8199 f3a1 b500 0000  ................
00119e0: 0049 454e 44ae 4260 8216 0600 00ff fe3c  .IEND.B`.......<
00119f0: 0052 006f 006f 0074 003e 000a 0020 0020  .R.o.o.t.>... . 
0011a00: 0020 0020 003c 0053 0074 0061 0074 0073  . . .<.S.t.a.t.s
0011a10: 003e 000a 0020 0020 0020 0020 0020 0020  .>... . . . . . 
0011a20: 0020 0020 003c 0056 0065 0072 0073 0069  . . .<.V.e.r.s.i
0011a30: 006f 006e 003e 0030 003c 002f 0056 0065  .o.n.>.0.<./.V.e
0011a40: 0072 0073 0069 006f 006e 003e 000a 0020  .r.s.i.o.n.>... 
0011a50: 0020 0020 0020 0020 0020 0020 0020 003c  . . . . . . . .<
0011a60: 0047 0061 006d 0065 0073 0050 006c 0061  .G.a.m.e.s.P.l.a
0011a70: 0079 0065 0064 003e 0034 0035 0031 003c  .y.e.d.>.4.5.1.<
0011a80: 002f 0047 0061 006d 0065 0073 0050 006c  ./.G.a.m.e.s.P.l
0011a90: 0061 0079 0065 0064 003e 000a 0020 0020  .a.y.e.d.>... .
[download]

Bill

Comment on Extract data from non-standard .xml file Download Code

Replies are listed 'Best First'.
Re: Extract data from non-standard .xml file by choroba (Cardinal) on Jan 20, 2015 at 13:32 UTC
Can't you just extract the parts between `<.R.o.o.t.>` and `<./.R.o.o.t.>`, convert from UTF-16 to UTF-8 and use XML::LibXML or XML::Twig or whatever convenient you love to process XML with? لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: Extract data from non-standard .xml file by BillKSmith (Monsignor) on Jan 21, 2015 at 20:28 UTC
Thanks for the suggestions. My first attempt at the extraction worked perfectly, but my encoding conversion did not. I decided to try to do the conversion before the extraction. I opened the file with a mode of "<:encoding(UCS2)", positioned the file a few characters before the start of the XML, and slurped the rest of the file into a string. As long as the file is positioned to an even numbered position, all the XML is read correctly except for the first "<" (which is not preceded by a NULL byte. Now the extraction is simple and I can prepend the missing "<". The resulting string can be parsed with any of several modules. I would prefer to do the extraction first. Can you tell me how to do the conversion? Bill	[reply]
Re^3: Extract data from non-standard .xml file by Anonymous Monk on Jan 21, 2015 at 21:28 UTC
Try `":raw:encoding(UTF-16LE)"` or `":raw:perlio:encoding(UTF-16LE)"` because `$ perl -MEncode -MData::Dump - print join q/ /, split/(....)/, q{00}.unpack q{H*}, encode( q{UTF-16LE}, qq{\x{FEFF}<Root>} ); __END__ 00ff fe3c 0052 006f 006f 0074 003e 00` [download] fffe is BOM for UTF-16LE, its the character right before "<"	[reply] [d/l] [select]
Re^4: Extract data from non-standard .xml file (utf bom) by Anonymous Monk on Jan 21, 2015 at 21:36 UTC
Re^4: Extract data from non-standard .xml file by BillKSmith (Monsignor) on Jan 25, 2015 at 02:39 UTC
Re: Extract data from non-standard .xml file by Mr. Muskrat (Canon) on Jan 29, 2015 at 19:05 UTC
According to this post on w7forums.com, it has an embedded PNG image followed by what may or may not be a checksum before the unicode XML content begins. HTH!	[reply]
Re^2: Extract data from non-standard .xml file by BillKSmith (Monsignor) on Jan 29, 2015 at 22:53 UTC
Thanks for the link, it may be useful in the future. So far, I only have to read the data, not edit it. I am now able to extract the XML with a REGEX, decode it into ASCII with the Encode module (using 'UTF16LE'), and parse the ASCII string with the module XML::Bare. I will now change my documentation to refer to the "block of binary data" to as the "PNG image". Bill	[reply]
Re: Extract data from non-standard .xml file by Anonymous Monk on Jan 30, 2015 at 13:31 UTC
I am basing the following guess on the only FreeCellSettings.xml file I have to work with, so I may be way off, but it looks to me like the offset to the XML data is stored at byte 8 of the file, and then the XML data begins with a 32-bit value representing its length. Try this, it works on my file: #!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; open my $fh, '<:raw', 'FreeCellSettings.xml' or die $!; seek $fh, 8, 0 or die; read($fh, my $off, 4) == 4 or die; $off = unpack 'V', $off; printf STDERR "Offset: 0x%X\n", $off; seek $fh, $off, 0 or die; read($fh, my $len, 4) == 4 or die; $len = unpack 'V', $len; printf STDERR "Length: 0x%X\n", $len; my $act_len = read($fh, my $data, $len); printf STDERR "Actually read 0x%X bytes\n", $act_len; warn "WARNING: Differing lengths, expected $len, actual $act_len" unless $act_len==$len; close $fh; use Encode qw/decode FB_CROAK/; my $xml = decode("UTF-16LE", $data, FB_CROAK); print $xml; [download]	[reply] [d/l]