comment on

When I first saw the mention of XML, I was tempted to suggest using something like XML::Simple to parse the data. However, I wasn't sure if your "sample data" included all of the possible XML tags from your real data files. So that got me thinking about doing a custom parsing of the data.

Anyways, I decided to challenge myself to see if I could come up with working code that would actually do the job without using an XML parsing module. Well, it may not be the "best" way, but the code below appears to do the job. Hopefully this rough bit of code is good enough to give you some ideas on how to do your file comparison. Enjoy!

Sample File 1 - data1.txt

<host id="bobjones" root-directory=".">
  <host-alias>www.foo.com</host-alias>
  <host-alias>www.bar.com</host-alias>
  <host-alias>www.dj.com</host-alias>
</host>

<host id="bobsmith" root-directory=".">
  <host-alias>www.abc.com</host-alias>
  <host-alias>www.def.com</host-alias>
  <host-alias>www.ghij.com</host-alias>
</host>

<host id="pauljones" root-directory=".">
  <host-alias>www.zyx.com</host-alias>
  <host-alias>www.wvut.com</host-alias>
  <host-alias>www.srqpon.com</host-alias>
</host>
[download]

Sample File 2 - data2.txt

<host id="mikebrown" root-directory=".">
  <host-alias>www.foo.com</host-alias>
  <host-alias>www.bar.com</host-alias>
  <host-alias>www.dj.com</host-alias>
</host>

<host id="bobjones" root-directory=".">
  <host-alias>www.bar.com</host-alias>
  <host-alias>www.dj.com</host-alias>
  <host-alias>www.music.com</host-alias>
</host>

<host id="bobsmith" root-directory=".">
  <host-alias>www.abc.com</host-alias>
  <host-alias>www.good.com</host-alias>
  <host-alias>www.def.com</host-alias>
  <host-alias>www.ghij.com</host-alias>
</host>

<host id="pauljones" root-directory=".">
  <host-alias>www.bad.com</host-alias>
  <host-alias>www.zyx.com</host-alias>
  <host-alias>www.wvut.com</host-alias>
  <host-alias>www.srqpon.com</host-alias>
</host>
[download]

Code:

use strict;

my $file1 = "data1.txt";
my $file2 = "data2.txt";

my $raw_data1 = Slurp_File($file1);
my $raw_data2 = Slurp_File($file2);

my (@sections1) = ($raw_data1 =~ m/(<host .+?\/host>)/sig);
my (@sections2) = ($raw_data2 =~ m/(<host .+?\/host>)/sig);

my %parsed_file;
foreach my $section (@sections2) {
  my ($id,@parsed_data) = Parse_Section($section);
  foreach my $alias (@parsed_data) {
    $parsed_file{$id}{$alias}++;
  }
}

foreach my $section (@sections1) {
  my ($id,@parsed_data) = Parse_Section($section);
  foreach my $alias (@parsed_data) {
    if (!$parsed_file{$id}{$alias}) {
      print "HostID: $id, Host-Alias: $alias was missing from file '$f
+ile2'\n";
    }
  }
}

############

sub Slurp_File {
  my $file = shift;
  my $data;
  open(DATA,"<",$file) || die "Unable to open file '$file':  $!\n";
  {
    local $/;
    $data = <DATA>;
  }
  close(DATA);
  return $data;
}

sub Parse_Section {
  my $data = shift;
  my ($id) = ($data =~ m/id=\"(.+?)\"/i);
  my (@alias) = ($data =~ m/host-alias>(.+?)</ig);
  my (@list) = ($id,@alias);
  return @list;
}
[download]

Output:

HostID: bobjones, Host-Alias: www.foo.com was missing from file 'data2
+.txt'
[download]

In reply to Re: Difficulty Mapping Data by dasgar
in thread Difficulty Mapping Data by walkingthecow

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.