Re^3: XML cleanup - regex or ?

Many of the downloads are 2+Gb long and I get memory errors if I do too much in RAM.

Well, that's a constraint that you didn't share initially. Had I been aware of that I would not have proposed slurping the file(s) into memory.

Now that I have a better understanding of the constraints, I would probably do something like the untested code below. For each file that needs 'cleaning', run the script below with the perl -i.bak, which opens the file for in place editing and backs it up to a file with the .bak file extension before opening the file for editing. (Without the .bak, Perl just overwrites the file with no backup.)

Basically, the code below will check a file line by line for each tag/attribute pairs specified. If an attribute is missing for a tag, that line is 'deleted' from the file. This might not be exactly what you want to do, but it should give you a framework to use for your own 'noise' handling operations.

use strict;
use warnings;

my %pairings;
my $file;

open(XML,$file) || die "Unable to open file '$file':  $!\n";
while (<XML>) {
    my $check = 0;
    foreach my $key (keys %pairings) {
        if (!(Check_Line($key,$_))) {
            $check++;
            last;
        }
    }
    if ($check == 0) {print;}
}
close(XML);

sub Initialize_Pairings {
    push @{$pairings{cat}},"tail","meow";
    push @{$pairings{dog}},"tail","bark";
}

sub Check_Line {
    my $tag = shift;
    my $line = shift;
    foreach my $i (0 .. $@{$pairings{$tag}}) {
        my $attrib = $pairings{$tag}[$i];
        if ($line !~ m/<$tag .*$attrib=\s+/i) {
            return 0;
        }
    }
    return 1;
}
[download]

Comment on Re^3: XML cleanup - regex or ? Select or Download Code