jccunning has asked for the wisdom of the Perl Monks concerning the following question:

I looking for insight and help to map out psuedocode to achieve xml comparison with informative output. Problem: Comparing two xml files that represent a documented API. Using doxygen to generate API into perl module output of hashes and arrays. Wrote script that uses XML::Simple to convert to xml, does a great job. Now I would like perl script to compare xml files and report what has changed, such as: New Class SomeClass::BitVector has been added to API. Public method parameter: captureSettings has been added to public method: serialize in the Class: SomeClass::Core Class SomeClass::ConfigException has been removed from API. All information for each class is contained within parent element <classes name="SomeClass::BitVector"> for example. Children elements are <private_members>, etc.

What approach should I take. Maybe take each classes element and all subelements and assign to an object and compare objects, is that possible. Or, suck everything for each class into an array and compare arrays. Any ideas on best approach and sample code much appreciated.

  • Comment on comparing xml and producing informative output

Replies are listed 'Best First'.
Re: comparing xml and producing informative output
by tobyink (Canon) on Jul 20, 2012 at 23:16 UTC

    Start with XML::SemanticDiff.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      XML SemanticDiff works pretty good except I discovered that a child element can be reported as new rogue element in new file if not in same location as old file. Example if old file contains:
      <classes name="Zanoply::AccessLogic"> <all_members name="accessLogic"/> <all_members name="DBUS"/> </classes>
      And new file contains:
      <classes name="Zanoply::AccessLogic"> <all_members name="accessLogic"/> <all_members name="SPR"/> <all_members name="DBUS"/> </classes>
      Then DBUS is reported both as a change in attribute value and new rogue element.
Re: comparing xml and producing informative output
by Anonymous Monk on Jul 21, 2012 at 00:14 UTC
    The best approach to not compare xml -- since XML is basically a TREE structure, create a real file directory tree, and use diff -ruN as usual

      Just because two things have tree-like structures, that doesn't mean it's always trivial to convert between them.

      Example 1: some XML dialects (such as XHTML) attach significance to the order in which the elements occur; others (such as RDF/XML) do not; others (like Atom) treat order as significant in some places but not others. Generally speaking, filesystems do not assign any significance to the order elements occur in. A directory containing foo.txt and bar.txt, and another directory containing bar.txt and foo.txt are effectively the same thing. (And diff will certainly treat them the same.)

      Example 2: Many XML dialects allow sibling elements to have the same tag name. Given the following XHTML:

      <html> <head /> <body /> </html>

      ... it might seem obvious to create a top-level directory called html with subdirectories head and body. But then let's expand upon that XHTML...

      <html> <head /> <body> <p>Foo</p> <p>Bar</p> </body> </html>

      ... and then the body directory needs to contain two subdirectories, each called p. I don't know of any filesystems that would allow this.

      Example 3: filesystems typically have two types on entities: directories and files. XML trees have more: elements, attributes, text nodes, comments, and (admittedly rarely) processing instructions. So you decide you want to represent elements as directories, and attributes as files, how do you represent the rest?

      OK, so these are problems which are not insurmountable, but mapping XML's structure to the filesystem is not a trivial exercise at all.

      The original poster had a fairly simple problem: how to compare two XML files and get a list of only the differences they care about (e.g. whitespace differences might not be important).

      Going the filesystem route, they've swapped one problem for two - now they have to perform a potentially complex mapping from XML to the filesystem, and then find a way to compare two directories and get a list of only the differences they care about.

      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'