in reply to Extract data from non-standard .xml file

Can't you just extract the parts between <.R.o.o.t.> and <./.R.o.o.t.>, convert from UTF-16 to UTF-8 and use XML::LibXML or XML::Twig or whatever convenient you love to process XML with?
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Replies are listed 'Best First'.
Re^2: Extract data from non-standard .xml file
by BillKSmith (Monsignor) on Jan 21, 2015 at 20:28 UTC
    Thanks for the suggestions. My first attempt at the extraction worked perfectly, but my encoding conversion did not. I decided to try to do the conversion before the extraction. I opened the file with a mode of "<:encoding(UCS2)", positioned the file a few characters before the start of the XML, and slurped the rest of the file into a string. As long as the file is positioned to an even numbered position, all the XML is read correctly except for the first "<" (which is not preceded by a NULL byte. Now the extraction is simple and I can prepend the missing "<". The resulting string can be parsed with any of several modules. I would prefer to do the extraction first. Can you tell me how to do the conversion?
    Bill
      Try  ":raw:encoding(UTF-16LE)" or  ":raw:perlio:encoding(UTF-16LE)" because
      $ perl -MEncode -MData::Dump - print join q/ /, split/(....)/, q{00}.unpack q{H*}, encode( q{UTF-16LE}, qq{\x{FEFF}<Root>} ); __END__ 00ff fe3c 0052 006f 006f 0074 003e 00
      fffe is BOM for UTF-16LE, its the character right before "<"
        $ perl -MEncode -MData::Dump - $f{$_}=encode($_,qq{\x{feff}}) for grep/^ut/i, Encode->encodings(q{:all}); dd\%f; __END__ { "UTF-16" => "\xFE\xFF\xFE\xFF", "UTF-16BE" => "\xFE\xFF", "UTF-16LE" => "\xFF\xFE", "UTF-32" => "\0\0\xFE\xFF\0\0\xFE\xFF", "UTF-32BE" => "\0\0\xFE\xFF", "UTF-32LE" => "\xFF\xFE\0\0", "UTF-7" => "+/v8-", "utf-8-strict" => "\xEF\xBB\xBF", "utf8" => "\xEF\xBB\xBF", }
        Your suggestion worked perfectly for a few days. Then the length of the block of binary data grew. The BOM character was no longer in the same place. When I seek to the old location, the read fails. Sure, I can fix the seek position and it will work for a while again. I still think that I need Choroba's original approach. Read the file in binary, extract the XML with a regular expression, convert the XML string to ASCII and then parse. I am not able to do the conversion step.
        Bill