comment on

Many of the downloads are 2+Gb long and I get memory errors if I do too much in RAM.

Well, that's a constraint that you didn't share initially. Had I been aware of that I would not have proposed slurping the file(s) into memory.

Now that I have a better understanding of the constraints, I would probably do something like the untested code below. For each file that needs 'cleaning', run the script below with the perl -i.bak, which opens the file for in place editing and backs it up to a file with the .bak file extension before opening the file for editing. (Without the .bak, Perl just overwrites the file with no backup.)

Basically, the code below will check a file line by line for each tag/attribute pairs specified. If an attribute is missing for a tag, that line is 'deleted' from the file. This might not be exactly what you want to do, but it should give you a framework to use for your own 'noise' handling operations.

use strict;
use warnings;

my %pairings;
my $file;

open(XML,$file) || die "Unable to open file '$file':  $!\n";
while (<XML>) {
    my $check = 0;
    foreach my $key (keys %pairings) {
        if (!(Check_Line($key,$_))) {
            $check++;
            last;
        }
    }
    if ($check == 0) {print;}
}
close(XML);

sub Initialize_Pairings {
    push @{$pairings{cat}},"tail","meow";
    push @{$pairings{dog}},"tail","bark";
}

sub Check_Line {
    my $tag = shift;
    my $line = shift;
    foreach my $i (0 .. $@{$pairings{$tag}}) {
        my $attrib = $pairings{$tag}[$i];
        if ($line !~ m/<$tag .*$attrib=\s+/i) {
            return 0;
        }
    }
    return 1;
}
[download]

In reply to Re^3: XML cleanup - regex or ? by dasgar
in thread XML cleanup - regex or ? by ethrbunny

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks