Wappel has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I want to split an xml file with the help of processing instructions. Every file should contain all the contents between two <?split ?> pis. The file name should contain the contents of the first and the last <no>-Element. example: Beispiel:
<?split ?> <h1>text</h1> <text> Textinhalt <no>4</no> </text> text ... text <no>18</no> <h6>text</h6> <?split ?> <text>...</text>
in the above example the file name should be test-nr4to18.xml and it should contain
<h1>text</h1> <text> Textinhalt <no>4</no> </text> text ... text <no>18</no> <h6>text</h6>
I would like to use an XML module (eg. XML::Twig) but how do I get the contents between the Pis into a node set for further processing? Can anybody help me? Thanks Wolfgang

edited: Wed Jun 23 14:37:14 2004 by jeffa - title change s/xmlf/XML/

Replies are listed 'Best First'.
Re: Splitting XML file on Processing Instructions
by BrowserUk (Patriarch) on Jun 23, 2004 at 16:16 UTC

    Setting $/ = '<?split ?>'; will allow you to read the the file in chunks. Then it's a case of deciding what the filename should be, opening it and writing it out.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: Splitting XML file on Processing Instructions
by iburrell (Chaplain) on Jun 23, 2004 at 16:33 UTC
    It is quite possible that a block split by processing instructions is not a node set. The PIs could occur anywhere and split elements.

    You should be able to split the XML file. I would think using handlers instead of twigs would make the most sense. Have the processing instruction handler open an new split file, and have the other handlers write to the split file.

Re: Splitting XML file on Processing Instructions
by pbeckingham (Parson) on Jun 23, 2004 at 17:56 UTC

    You would need to add file I/O error handling, and perhaps handle cases of missing tags. I assumed an input file named data. I assumed that the value inside <h1>...</h1> was used in the file name, because there is no reference to 'test' in the file, and perhaps you meant 'text' instead. But despite the caveats, this does something like you wanted:

    #! /usr/bin/perl -w use strict; my $text; if (open INPUT, '<data') { local $/; $text = <INPUT>; close INPUT; } while ($text =~ /<\?split \?>(.*?)(?=<\?split \?>)/sg) { my $fragment = $1; my ($h1) = $fragment =~ /<h1>(.*?)<\/h1>/is; my ($from, $to) = $fragment =~ /<no>(.*?)<\/no>/isg; if (open OUTPUT, ">${h1}-nr${from}to${to}.xml") { print OUTPUT $fragment; close OUTPUT; } } exit 0;

      Thanks for your advice! Your code works fine with respect to the splitting, but the problem is that it gets only the first two <no> elements for the file name. If I add for example a third (or more) <no>-element then $from contains the contents of the first <no> element and $to contains the contents of the second <no> element but I need to get the contents of the first and last <no> element for the file name in the following example that would be text-nr4to20.xml
      <?split ?> <h1>text</h1> <text> Textinhalt <no>4</no> </text> text ... text <no>18</no> <no>19</no> <no>20</no> <h6>text</h6> <?split ?>
      Maybe it's possible to write the contents of the <no> elements to an array and get the first with $a[0] and the last with $a[$#] but I didn't manage to change the code with respect to that. Maybe you can help me once more?

        No problem. This just requires that the <no>...</no> tags are put in an array. But we use $numbers[-1] instead of $numbers[$#numbers].

        #! /usr/bin/perl -w use strict; my $text; if (open INPUT, '<data') { local $/; $text = <INPUT>; close INPUT; } while ($text =~ /<\?split \?>(.*?)(?=<\?split \?>)/sg) { my $fragment = $1; my ($h1) = $fragment =~ /<h1>(.*?)<\/h1>/is; my @numbers = $fragment =~ /<no>(.*?)<\/no>/isg; if (open OUTPUT, ">${h1}-nr${numbers[0]}to${numbers[0-1}.xml") { print OUTPUT $fragment; close OUTPUT; } } exit 0;