in reply to Re^2: Multiple XML files from Directory to One XML file using perl.
in thread Multiple XML files from Directory to One XML file using perl.

FIrst, I don't understand why the <?xml ...?> lines from all the input files are being included in the single output file -- when I use my code as posted, it removes those from each input. Either you're running something different from what I posted, or else there's something odd about the <?xml... lines in your data files.

As for what you really want, which is one <shiporder ...> element containing all the content of all the files (that is, combining the "shipto" elements from all the input files into one "shiporder"), that's a different plan from what I was suggesting, and it would be best to use a parser for that.

In fact, it seems like the OP code is really pretty close to what you want. Here's my version, with Digest::MD5 thrown in to eliminate duplicate "shipto" content:

#!/usr/lib/perl use strict; use warnings; use Carp; use File::Find; use File::Spec::Functions qw( canonpath ); use XML::LibXML::Reader; use Digest::MD5 'md5'; if ( @ARGV == 0 ) { push @ARGV, "C:/file/dir"; warn "Using default path $ARGV[0]\n Usage: $0 path ...\n"; } # open an output file whose name won't be found by File::Find open( my $allxml, '>', "all_shiporders.xml.combined" ) or die "can't open output xml file for writing: $!\n"; print $allxml '<?xml version="1.0" encoding="UTF-8"?>', "\n<shiporder xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instanc +e\">\n"; my %shipto_md5; find( sub { return unless ( /[.]xml\z/i and -f ); extract_information(); return; }, @ARGV ); print $allxml "</shiporder>\n"; sub extract_information { my $path = $_; if ( my $reader = XML::LibXML::Reader->new( location => $path )) { while ( $reader->nextElement( 'shipto' )) { my $elem = $reader->readOuterXml(); my $md5 = md5( $elem ); print $allxml $reader->readOuterXml() unless ( $shipto_md5 +{$md5}++ ); } } return; }
That seems to work on a set of files such as the following, leaving out "j4.xml" because it's identical to "j2.xml":
==> j1.xml <== <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> </shipto> </shiporder> ==> j2.xml <== <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>benny</name> <address>galve 23</address> </shipto> </shiporder> ==> j3.xml <== <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>kent</name> <address>vadrss 25</address> </shipto> </shiporder> ==> j4.xml <== <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>benny</name> <address>galve 23</address> </shipto> </shiporder> ==> j5.xml <== <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>stewart</name> <address>vadrss 25</address> </shipto> </shiporder>
Here's the output -- the only difference between this and what you wanted is the absence vs. presence of extra line-feeds around the "shipto" tags, which is just a matter of cosmetics:
<?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> </shipto><shipto> <name>benny</name> <address>galve 23</address> </shipto><shipto> <name>kent</name> <address>vadrss 25</address> </shipto><shipto> <name>stewart</name> <address>vadrss 25</address> </shipto></shiporder>

Replies are listed 'Best First'.
Re^4: Multiple XML files from Directory to One XML file using perl.
by jyo (Initiate) on Nov 21, 2011 at 14:49 UTC

    Hi, In the script MD5 is no use because I dont have repeated nodes with same content, I have repeated nodes with one tag element, can you help me with that how to remove node information by searching that tag element.please help me with this problem. I am not able to implement logic.

      I am not able to implement logic.

      I don't know what that is supposed to mean (unless it means that you really don't know how to write program code, and you're unwilling to try).

      In the script MD5 is no use because I dont have repeated nodes with same content, I have repeated nodes with one tag element

      I don't know what that is supposed to mean either. Are you able to show a show an exact example of what "repeated codes with one tag element" would look like? Can you at least describe how you would recognize the kind of content that needs to be removed?

      If you know how to describe the task clearly, that's just the first part of "implementing logic"... If you can't do that, then I guess we're done, and you should find something else to do (and find someone else to do this job instead of you).