Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: XML cleanup - regex or ?

by dasgar (Priest)
on Sep 21, 2010 at 14:46 UTC ( [id://861084]=note: print w/replies, xml ) Need Help??


in reply to XML cleanup - regex or ?

Here's one approach, if you want to do it by hand.

  • Slurp the file into a variable
  • Grab all of the 'cat' tags and put them in an array.
    my (@cats) = ($file =~ m/(<cat.+?\/>)/ig)
  • Then find the ones that have the missing attributes.
    foreach my $cat (@cats) { if ($cat !~ m/meow=.+?/i) { # do action here } }

The process should work. However, if you have hundreds of tag/attribute combinations, you probably wouldn't want to hard code those combinations. Instead, you might prefer to do a subroutine and pass in the tag and attribute combo.

Hope this helps.

Replies are listed 'Best First'.
Re^2: XML cleanup - regex or ?
by ethrbunny (Monk) on Sep 21, 2010 at 15:11 UTC
    I have to clean the file line by line. Many of the downloads are 2+Gb long and I get memory errors if I do too much in RAM.

    I've been considering a 'cascade' of regex to toss out the noise. Something like testing for the cat, dog, etc, then looking for the param list. Lots of nested ifs. It seems messy but it might be the only avenue. I was hoping there was a slick regex process to do this instead.

         Many of the downloads are 2+Gb long and I get memory errors if I do too much in RAM.

      Well, that's a constraint that you didn't share initially. Had I been aware of that I would not have proposed slurping the file(s) into memory.

      Now that I have a better understanding of the constraints, I would probably do something like the untested code below. For each file that needs 'cleaning', run the script below with the perl -i.bak, which opens the file for in place editing and backs it up to a file with the .bak file extension before opening the file for editing. (Without the .bak, Perl just overwrites the file with no backup.)

      Basically, the code below will check a file line by line for each tag/attribute pairs specified. If an attribute is missing for a tag, that line is 'deleted' from the file. This might not be exactly what you want to do, but it should give you a framework to use for your own 'noise' handling operations.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://861084]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (2)
As of 2024-04-25 18:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found