in reply to Multiple XML files from Directory to One XML file using perl.
It's good that you already solved the part about finding all the files -- I'll use the OP code as a starting point (thanks for that), and reduce it down to just the essentials:
The point is that, since each input xml file is a fully self-contained element, and you probably don't want to disrupt that structure, all you need is to create a novel tag that won't get confused with any existing content, and use that as the one element that will contain everything else being put into the new file. Just drop the initial <?xml...?> line from each input file. (I've seen a lot of "xml" files that don't start with that, so I think it's worthwhile to check.)#!/usr/lib/perl use strict; use warnings; use Carp; use File::Find; use File::Spec::Functions qw( canonpath ); if ( @ARGV == 0 ) { push @ARGV, "C:/file/dir"; warn "Using default path $ARGV[0]\n Usage: $0 path ...\n"; } # open an output file whose name won't be found by File::Find open( my $allxml, '>', "all_xml_contents.combined" ) or die "can't open output xml file for writing: $!\n"; print $allxml '<?xml version="1.0" encoding="UTF-8"?>', "\n<all_xml_co +ntents>\n"; find( sub { return unless ( /[.]xml\z/i and -f ); extract_information(); return; }, @ARGV ); print $allxml "</all_xml_contents>\n"; sub extract_information { my $path = $_; if ( open my $xmlin, '<', $path ) { local $_ = <$xmlin>; print $allxml $_ unless ( /<\?xml/ ); while ( <$xmlin> ) { print $allxml $_; } } return; }
Other things I changed in the code were:
If your duplication problem is really just a matter of the (exact) same xml content showing up in multiple files (e.g. "foo1.xml" is a copy of "foo2.xml", or "blah1/foo.xml" is a copy of "blah2/foo.xml"), you can simply get md5 signatures of all the files first, sort by md5 values, and look for duplicates that way (files with identical content will have identical md5 values).
But if the duplication problem involves elements that make up parts of files, then a parser is the only way to go, and you'll need to know enough about the data to figure out which elements need to be checked for duplicate content. If you know which tags to look at, running a parser on the "all-combined" xml will make it easy to find and remove the duplicates.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Multiple XML files from Directory to One XML file using perl.
by jyo (Initiate) on Nov 21, 2011 at 09:34 UTC | |
by graff (Chancellor) on Nov 21, 2011 at 11:58 UTC | |
by jyo (Initiate) on Nov 21, 2011 at 14:49 UTC | |
by graff (Chancellor) on Nov 27, 2011 at 04:24 UTC |