Re: Extract data from non-standard .xml file

Replies are listed 'Best First'.
Re^2: Extract data from non-standard .xml file by BillKSmith (Monsignor) on Jan 21, 2015 at 20:28 UTC
Thanks for the suggestions. My first attempt at the extraction worked perfectly, but my encoding conversion did not. I decided to try to do the conversion before the extraction. I opened the file with a mode of "<:encoding(UCS2)", positioned the file a few characters before the start of the XML, and slurped the rest of the file into a string. As long as the file is positioned to an even numbered position, all the XML is read correctly except for the first "<" (which is not preceded by a NULL byte. Now the extraction is simple and I can prepend the missing "<". The resulting string can be parsed with any of several modules. I would prefer to do the extraction first. Can you tell me how to do the conversion? Bill	[reply]
Re^3: Extract data from non-standard .xml file by Anonymous Monk on Jan 21, 2015 at 21:28 UTC
Try `":raw:encoding(UTF-16LE)"` or `":raw:perlio:encoding(UTF-16LE)"` because `$ perl -MEncode -MData::Dump - print join q/ /, split/(....)/, q{00}.unpack q{H*}, encode( q{UTF-16LE}, qq{\x{FEFF}<Root>} ); __END__ 00ff fe3c 0052 006f 006f 0074 003e 00` [download] fffe is BOM for UTF-16LE, its the character right before "<"	[reply] [d/l] [select]
Re^4: Extract data from non-standard .xml file (utf bom) by Anonymous Monk on Jan 21, 2015 at 21:36 UTC
`$ perl -MEncode -MData::Dump - $f{$_}=encode($_,qq{\x{feff}}) for grep/^ut/i, Encode->encodings(q{:all}); dd\%f; __END__ { "UTF-16" => "\xFE\xFF\xFE\xFF", "UTF-16BE" => "\xFE\xFF", "UTF-16LE" => "\xFF\xFE", "UTF-32" => "\0\0\xFE\xFF\0\0\xFE\xFF", "UTF-32BE" => "\0\0\xFE\xFF", "UTF-32LE" => "\xFF\xFE\0\0", "UTF-7" => "+/v8-", "utf-8-strict" => "\xEF\xBB\xBF", "utf8" => "\xEF\xBB\xBF", }` [download]	[reply] [d/l]
Re^4: Extract data from non-standard .xml file by BillKSmith (Monsignor) on Jan 25, 2015 at 02:39 UTC
Your suggestion worked perfectly for a few days. Then the length of the block of binary data grew. The BOM character was no longer in the same place. When I seek to the old location, the read fails. Sure, I can fix the seek position and it will work for a while again. I still think that I need Choroba's original approach. Read the file in binary, extract the XML with a regular expression, convert the XML string to ASCII and then parse. I am not able to do the conversion step. Bill	[reply]