BillKSmith has asked for the wisdom of the Perl Monks concerning the following question:

Several years ago, I wrote a program under windows XP, which fetches scores for the frecell game and computes additional statistics. This program does not work on my 64-bit windows 7 system because Microsoft has moved these scores from the registry to a file.

With a little help from google, I have found that this data is stored in an xml format and appended to a large (approx 72k) block of binary data. Characters are encoded as a 2-byte unicode (ASCII code in the first byte, null in the second) The sample below includes one of the required parameters: <GamesPlayed>451</Games_Played>.

I am able to read the file as a binary file, find and extract the data with a regular expression, and concatenate the digits to form the numeric result.

I would like a cleaner, more flexible approach. Note: I only need 'read' access. There is no requirement to change the scores.

Several XML modules return a hash of the xml data. This would be perfect but I have not found one that can skip over the irrelevant binary data and handle the backward character encoding. I do not know if the block of binary data has a fixed length. For now, I must assume that it does not.

Here is a hex dump of the start of the xml (made with xxd command of vim):

"C:\Users\Bill\AppData\Local\Microsoft Games\FreeCell\FreeCellSettings.xml"

00119c0: b324 2d18 bb60 6f83 47f8 8b61 648b 70e1 .$-..`o.G..ad.p. 00119d0: fdf3 f7ff 070e 990a 8199 f3a1 b500 0000 ................ 00119e0: 0049 454e 44ae 4260 8216 0600 00ff fe3c .IEND.B`.......< 00119f0: 0052 006f 006f 0074 003e 000a 0020 0020 .R.o.o.t.>... . 0011a00: 0020 0020 003c 0053 0074 0061 0074 0073 . . .<.S.t.a.t.s 0011a10: 003e 000a 0020 0020 0020 0020 0020 0020 .>... . . . . . 0011a20: 0020 0020 003c 0056 0065 0072 0073 0069 . . .<.V.e.r.s.i 0011a30: 006f 006e 003e 0030 003c 002f 0056 0065 .o.n.>.0.<./.V.e 0011a40: 0072 0073 0069 006f 006e 003e 000a 0020 .r.s.i.o.n.>... 0011a50: 0020 0020 0020 0020 0020 0020 0020 003c . . . . . . . .< 0011a60: 0047 0061 006d 0065 0073 0050 006c 0061 .G.a.m.e.s.P.l.a 0011a70: 0079 0065 0064 003e 0034 0035 0031 003c .y.e.d.>.4.5.1.< 0011a80: 002f 0047 0061 006d 0065 0073 0050 006c ./.G.a.m.e.s.P.l 0011a90: 0061 0079 0065 0064 003e 000a 0020 0020 .a.y.e.d.>... .
Bill

Replies are listed 'Best First'.
Re: Extract data from non-standard .xml file
by choroba (Cardinal) on Jan 20, 2015 at 13:32 UTC
    Can't you just extract the parts between <.R.o.o.t.> and <./.R.o.o.t.>, convert from UTF-16 to UTF-8 and use XML::LibXML or XML::Twig or whatever convenient you love to process XML with?
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Thanks for the suggestions. My first attempt at the extraction worked perfectly, but my encoding conversion did not. I decided to try to do the conversion before the extraction. I opened the file with a mode of "<:encoding(UCS2)", positioned the file a few characters before the start of the XML, and slurped the rest of the file into a string. As long as the file is positioned to an even numbered position, all the XML is read correctly except for the first "<" (which is not preceded by a NULL byte. Now the extraction is simple and I can prepend the missing "<". The resulting string can be parsed with any of several modules. I would prefer to do the extraction first. Can you tell me how to do the conversion?
      Bill
        Try  ":raw:encoding(UTF-16LE)" or  ":raw:perlio:encoding(UTF-16LE)" because
        $ perl -MEncode -MData::Dump - print join q/ /, split/(....)/, q{00}.unpack q{H*}, encode( q{UTF-16LE}, qq{\x{FEFF}<Root>} ); __END__ 00ff fe3c 0052 006f 006f 0074 003e 00
        fffe is BOM for UTF-16LE, its the character right before "<"
Re: Extract data from non-standard .xml file
by Mr. Muskrat (Canon) on Jan 29, 2015 at 19:05 UTC

    According to this post on w7forums.com, it has an embedded PNG image followed by what may or may not be a checksum before the unicode XML content begins. HTH!

      Thanks for the link, it may be useful in the future. So far, I only have to read the data, not edit it. I am now able to extract the XML with a REGEX, decode it into ASCII with the Encode module (using 'UTF16LE'), and parse the ASCII string with the module XML::Bare.

      I will now change my documentation to refer to the "block of binary data" to as the "PNG image".

      Bill
Re: Extract data from non-standard .xml file
by Anonymous Monk on Jan 30, 2015 at 13:31 UTC

    I am basing the following guess on the only FreeCellSettings.xml file I have to work with, so I may be way off, but it looks to me like the offset to the XML data is stored at byte 8 of the file, and then the XML data begins with a 32-bit value representing its length. Try this, it works on my file:

    #!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; open my $fh, '<:raw', 'FreeCellSettings.xml' or die $!; seek $fh, 8, 0 or die; read($fh, my $off, 4) == 4 or die; $off = unpack 'V', $off; printf STDERR "Offset: 0x%X\n", $off; seek $fh, $off, 0 or die; read($fh, my $len, 4) == 4 or die; $len = unpack 'V', $len; printf STDERR "Length: 0x%X\n", $len; my $act_len = read($fh, my $data, $len); printf STDERR "Actually read 0x%X bytes\n", $act_len; warn "WARNING: Differing lengths, expected $len, actual $act_len" unless $act_len==$len; close $fh; use Encode qw/decode FB_CROAK/; my $xml = decode("UTF-16LE", $data, FB_CROAK); print $xml;