I suppose that if you were to make up a tag name to use as the one single container for all your existing xml files, it would be a pretty simple matter, and probably wouldn't even involve xml parsing at all. You just need to make sure that the new tag name that you make up does not already occur as a tag in any of the existing xml files.

It's good that you already solved the part about finding all the files -- I'll use the OP code as a starting point (thanks for that), and reduce it down to just the essentials:

#!/usr/lib/perl use strict; use warnings; use Carp; use File::Find; use File::Spec::Functions qw( canonpath ); if ( @ARGV == 0 ) { push @ARGV, "C:/file/dir"; warn "Using default path $ARGV[0]\n Usage: $0 path ...\n"; } # open an output file whose name won't be found by File::Find open( my $allxml, '>', "all_xml_contents.combined" ) or die "can't open output xml file for writing: $!\n"; print $allxml '<?xml version="1.0" encoding="UTF-8"?>', "\n<all_xml_co +ntents>\n"; find( sub { return unless ( /[.]xml\z/i and -f ); extract_information(); return; }, @ARGV ); print $allxml "</all_xml_contents>\n"; sub extract_information { my $path = $_; if ( open my $xmlin, '<', $path ) { local $_ = <$xmlin>; print $allxml $_ unless ( /<\?xml/ ); while ( <$xmlin> ) { print $allxml $_; } } return; }
The point is that, since each input xml file is a fully self-contained element, and you probably don't want to disrupt that structure, all you need is to create a novel tag that won't get confused with any existing content, and use that as the one element that will contain everything else being put into the new file. Just drop the initial <?xml...?> line from each input file. (I've seen a lot of "xml" files that don't start with that, so I think it's worthwhile to check.)

Other things I changed in the code were:

Now, this doesn't handle the problem of removing duplicate xml content, but that's something that will be a lot easier to do after you've written the one big xml file. That's where a good parsing module (like XML::LibXML) will come in very handy.

If your duplication problem is really just a matter of the (exact) same xml content showing up in multiple files (e.g. "foo1.xml" is a copy of "foo2.xml", or "blah1/foo.xml" is a copy of "blah2/foo.xml"), you can simply get md5 signatures of all the files first, sort by md5 values, and look for duplicates that way (files with identical content will have identical md5 values).

But if the duplication problem involves elements that make up parts of files, then a parser is the only way to go, and you'll need to know enough about the data to figure out which elements need to be checked for duplicate content. If you know which tags to look at, running a parser on the "all-combined" xml will make it easy to find and remove the duplicates.


In reply to Re: Multiple XML files from Directory to One XML file using perl. by graff
in thread Multiple XML files from Directory to One XML file using perl. by jyo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.