Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I have a situation where I've concatenated a few thousand xml files together and I need to remove the <root> tags from between the parts, so that the new file is well-formed xml.

. I'm using this sort of approach:

perl -pi.bak -e 's/</root>\s*<\?xml.*?>\s*<root>//sig' file.xml

However, that does not work because each of those three xml tags above is on a different line in the file. It occurred to me that the /s flag isn't doing its thing because the file is being read one line at a time.

Naturally, my next thought was to undef $/ , but this file is too big to live entirely in memory. Anyone got any ideas about how to remove these repeating chunks from my file:

</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>

Thanks,
Monk

Replies are listed 'Best First'.
Re: Replacing Lines in 100 Gig file
by bgreenlee (Friar) on Dec 17, 2004 at 18:02 UTC

    Ah! A use for perl's flip-flip operator:

    #!/usr/bin/perl -w while (<DATA>) { next if m|</root>|..m|<root>|; print; } print "</root>\n"; __DATA__ <?xml version="1.0" encoding="UTF-8"?> <root> <some_xml></some_xml> </root> <?xml version="1.0" encoding="UTF-8"?> <root> <some_xml></some_xml> </root> <?xml version="1.0" encoding="UTF-8"?> <root> <some_xml></some_xml> </root>

    -b

      Neat. I think I know what it's doing but I'm not sure why. Could you point me in the direction of the appropriate perldoc that explains the .. foo?

        perldoc perlop. Search for "flip-flop".

        In a nutshell, though, it evaluates to false until the first condition is satisfied, and then evaluates to true until the second condition is satisfied.

        -b

Re: Replacing Lines in 100 Gig file
by dragonchild (Archbishop) on Dec 17, 2004 at 17:49 UTC
    Why not make copies of the thousand XML files, remove the root nodes from the copies, then concatenate the files?

    Plus, what use is a 100G file anyways??

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      100G files! Hmm, some of your batch processing guys don't like you ...
Re: Replacing Lines in 100 Gig file
by BrowserUk (Patriarch) on Dec 17, 2004 at 18:08 UTC

    Resisting the temptation to say "Silly boy!", change quotes to suit.

    perl -ple"/^</root>/ and map{ scalar <> } 1..3" file.xml >new.xml

    Examine what is said, not who speaks.        The end of an era!
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Hi Guys,
      you've all been most helpful. The solution I went with was the flip-flop operator. It did exactly what I needed. It is now tucked away in the bag-o-tricks.

      This board comes through every time I need something!
Re: Replacing Lines in 100 Gig file
by amw1 (Friar) on Dec 17, 2004 at 17:44 UTC
    You could try using Tie::File. It ties an array to a file without loading the whole thing into memory.

    From there it should be pretty straight forward to loop through the array and strip out the stuff you want to strip. Changes to the array modify the file.

    The perldoc for it should tell you what you need to know.

    UPDATE: it's the looping line by line that wasn't working for you. Perhaps looping doing something like this

    mmm pseudo code

    my $cut = 0; while(<FILE>) { if(/<\/root>/) { $cut = 1; } elsif(/<root>/) { $cut = 0; next; # skip this line as well, but stop cutting } if(!$cut) { print OUTFILE $_; } }
Re: Replacing Lines in 100 Gig file
by osunderdog (Deacon) on Dec 17, 2004 at 19:45 UTC

    I had a similar problem, but my approach was to just surround the catenated xml files with <root></root> tags.

    I did the rest with XSLT.


    "Look, Shiny Things!" is not a better business strategy than compatibility and reuse.


    OSUnderdog
Re: Replacing Lines in 100 Gig file
by Anonymous Monk on Dec 17, 2004 at 23:30 UTC
    just delete all <root>,</root> and <?xml...> after the second line and add a </root> at the end.
    perl -pi.bak -e 'next unless $. > 2; s{</?root>\s*}{}; s{<\?xml.*\?>\s +*}{};' foo.xml echo '</root>' >> foo.xml
    can't quite figure out how to leave that last </root> alone or how to add it at the end of processing...
Re: Replacing Lines in 100 Gig file
by iburrell (Chaplain) on Dec 20, 2004 at 21:05 UTC
    Is '<root>' a valid element in the rest of the file? If not, then just remove all '<?xml' and '<root>' and '</root>' lines. And write the initial and final ones manually.