donp has asked for the wisdom of the Perl Monks concerning the following question:

Hallo PerlMonks,

Just found perlmonks.org when googling for a solution to a special problem. I'm not an experienced Perl programmer and english is not my native language, so please have mercy on me...

My Problem: I've got large XML files (about 10MB each) which I'd like to split into smaller chunks and later merge back to big ones.

xml_split fom module XML::Twig does a good job when used with option -c (condition), but due to the structure of some of my XML files this sometimes gives me thousands of small chunks, which is far too much.

Options -s (chunk size) or -g (group certain tags) seem to be the best choice, but unfortunately all text nodes are then lost after xml_split is finished with my data. Just XML tags are left over.

The reason for that seems to be that xml_split does not use XML::Twig when options -s and -g are active. Instead it uses XML::Parser directly. In this case, XML::Parser neither calls text handlers nor default handlers, but why? It seems to just delete or skip all text for some strange reason.

I can hardly believe that noone else has stumbled upon or even solved this problem so far.

So: Does anyone know about a solution for this or perhaps about some different XML splitting tool/module I could use? I've been searching for quite a while but only found that "xml_split".

Thanks for any help,
donp

  • Comment on Tool "xml_split" from XML::Twig removes all text

Replies are listed 'Best First'.
Re: Tool "xml_split" from XML::Twig removes all text
by mirod (Canon) on May 30, 2008 at 17:36 UTC

    Hum, that's weird.

    Could you post a description of the problem (this post is fine) and a (small) XML file that shows the problem to the RT queue for the module? That's the best way to report problems, so I can't forget about them.

    Thanks

Re: Tool "xml_split" from XML::Twig removes all text
by Zen (Deacon) on May 30, 2008 at 21:50 UTC
    Your english is great, by the way.
Re: Tool "xml_split" from XML::Twig removes all text
by jurple (Initiate) on Sep 08, 2010 at 10:06 UTC
    It'd seem this is still the case. Bother. I don't see any mentions of this in RT, so I'll post it there.
    At first glance, it appears XML::Parser::original_string does not return return anything for text nodes. Or, Default handler doesn't get called for text nodes. Or I'm wrong.

    Cheers,
    Jurgen
      I can confirm it still strips text nodes when using those switches.

        Could you tell me the version of expat that you are using? I couldn't really follow up on the bug with the original poster on RT as they didn't provide me with this information. Thanks