Cleaning Files

vek has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks,

I just had to whip up a quick XML filtering prog to cleanup some nastily formatted XML files. Here's a quick snippet:

<?xml version="1.0"?>
<!DOCTYPE VersionMaint PUBLIC  "- EN"  "http://path-to/the-dtd">
<VersionMaint>
   <MaintLevelInfo>
      <MaintId>
      </MaintId>
   </MaintLevelInfo>
   <MaintRequest>
      <AppId>987624</AppId>
      <AppName>TriggerVar</AppName>
   </MaintRequest>
</VersionMaint>
[download]

The goal being to cleanup any empty block (in this example the MaintLevelInfo and MaintId blocks but there are more, sometimes 3 or 4 levels of empty blocks). The following code works but it just feels bad, I'm thinking there has to be a better way surely:

sub filterXML {
   my $xml = shift;
   my @xmlPieces = split (/\n/, $xml);

   my ($tmpFilter, $flag) = filter (\@xmlPieces);

   while ($flag) { 
      ($tmpFilter, $flag) = filter ($tmpFilter);
   }
   my $filtered = join "\n", @$tmpFilter;
   return $filtered;
}

sub filter {
   my $items = shift;
   my (@new, $prev, $haveFiltered);
   for my $xmlLine (@$items) {
      my $test = $xmlLine;
      $test =~ s/^\s+//g;
      if ($test =~ /^<\// && $prev =~ /^<\w/) {
         unless ($prev =~ /<(.*)>(.*)<(.*)>/) {
            pop @new;
            $prev = $test;
            $haveFiltered++;
            next;
         }
      }
      push (@new, $xmlLine);
      $prev = $test;
   }
   return (\@new, $haveFiltered);
}
[download]

Needless to say there's nothing I can do about the way the XML is originally formatted, I'm just the lucky fellow who has to fix it.

Comment on Cleaning Files Select or Download Code

Replies are listed 'Best First'.
Re: Cleaning Files by theguvnor (Chaplain) on Jan 24, 2002 at 06:45 UTC
It has been said by Monks wiser than myself that just using regular expressions is doomed to failure. I suggest reading XML::Parser Tutorial for a good discussion of the reasons. Or you could do some research on your own by looking at the XML modules in CPAN starting with XML::Parser. Because your goal (cleaning up empty tag blocks) is fairly straightforward, you might even be able to get away with XML::Simple which I have found to be just that, very simple to use. Good luck.	[reply]
Re: Re: Cleaning Files by vek (Prior) on Jan 24, 2002 at 10:25 UTC
I really must learn to clarify my posts. I'm not trying to parse the XML, just clean it before it is picked up on our FTP server by another department. I'm the middle man here - neither the generator of said XML nor the intended recipient. Thanks anyway for your suggestion. UPDATE - Thanks theguvnor, you're right of course - I am obviously parsing here. I also agree that regex parsing is a complete no-no for any form of long-term XML parsing solution (I use XML::Parser very frequently actually). The program I whipped up was a quick hack of a "fix" program that would be used on XML that I could guarantee would not change format, hence regex parsing is not as scary (perhaps). The thought being that "proper" XML parsing with a reputable parser (i.e XML::Parser) and then re-writing the XML out was overkill. Then again, perhaps it was a mistake to even post my code (grin) (it does work afterall) as I should have known I'd be taken out back and beaten with a stick for even mentioning XML and regex in the same breath (grin). Thanks again mate, I do appreciate your answers as it was obviously a dodgy post judging from the lack of overall response ;)	[reply]
Re: Re: Re: Cleaning Files by theguvnor (Chaplain) on Jan 24, 2002 at 22:15 UTC
Ah, but in order to clean it, you must do some kind of parsing. (That's what the process of reading a text stream in order to perform operations that will output a transformed text stream is.) My point is that regular expressions alone are prone to failure. But if you're looking to do a RE only solution, and you're ONLY looking to remove EMPTY tags then something involving a match on >< with a look-behind for < and look-ahead for > might do... Good luck in any event! Update: I'm not claiming to have a dictionary definition of parsing, I was just trying to explain how I see the situation: reading the file, separating tag from text, and throwing away the empty tags... in short, parsing :) Update: no problem vek, I didn't think it was a dodgy post in the least, and I wasn't trying to beat you down at all - in fact it sounds like you have way more experience with XML parsing than I do! Code on brother....	[reply]