Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Regex problem while parsing tagged, hierarchical data

by sivaramanm (Acolyte)
on Sep 12, 2006 at 14:30 UTC ( [id://572538]=perlquestion: print w/replies, xml ) Need Help??

sivaramanm has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

My Input is

<level1 id="L1_0001"> <level2> <level3> <level4> <level3> <level4> <level2> <level2> <level1 id="L1_0002"> <level1 id="L1_0003"> <level2> <level3>

I need id for level2 is id of immediate above level1's

<level1 id="L1_0001"> <level2 id="L1_0001"> <level3> <level4> <level3> <level4> <level2 id="L1_0001"> <level2 id="L1_0001"> <level1 id="L1_0002"> <level1 id="L1_0003"> <level2 id="L1_0003"> <level3>

I tried the following

$content =~s#(<level1 id=\"([^"]*)\"(?:(?!<level1).)*?)<level2>#$1<lev +el2 id="$2">#gsi;

I got (didn't get for every level2)

<level1 id="L1_0001"> <level2 id="L1_0001"> <level3> <level4> <level3> <level4> <level2> <level2> <level1 id="L1_0002"> <level1 id="L1_0003"> <level2 id="L1_0003"> <level3>

Pls help me to add id for missing level2

Thanks

Siva

2006-09-14 Retitled by planetscape, as per Monastery guidelines

( keep:0 edit:23 reap:0 )

Original title: 'Regex problem'

Replies are listed 'Best First'.
Re: Regex problem while parsing tagged, hierarchical data
by jdtoronto (Prior) on Sep 12, 2006 at 14:37 UTC
    It kinda looks like your parsing XML, so why not use an XML parser, such as XML::Simple? That way you are presented with a Perl data structure you can traverse rather than trying to craft a regex which is quite possibly slower and has far more potential for errors.

    jdtoronto

      More like some bizarre SGML variant than XML. Valid XML can't omit closing tags; the sample data does that everywhere. But yes, this is a job for a real parser not regexen.

        Yeah, they're right that you're probably best off with a parser. However, if you really wanted to stick with regex, you could do the job more slowly and less elegantly with a line-by-line solution like this (make sure to improve before actual use):
        while (<INPUT>) { if (m#<level1 id=\"([^"]*)\"#) { $id=$1; print; } elsif (m#(\s*)<level2>#) { print "$1<level2 id=\"$id\">\n"; } else { print; } }
        Your regex doesn't work because you don't capture the id from the level1 tag when you get to the 2nd level2 tag below that tag. This works by capturing the last level1 id into a persistent variable, and replacing level2 tags when it finds them.

        Hays

Re: Regex problem while parsing tagged, hierarchical data
by reasonablekeith (Deacon) on Sep 12, 2006 at 14:58 UTC
    as said above, if you have any control over the format of this data you should get it changed to either an easier to parse proprietary format or get it into valid xml. As it stands a full xml parser wouldn't touch it with a barge poll (and rightly so)

    having said that, you could approach your regex from the other end and work backwards. I think this does what you want.

    while ( $content =~ s/(.*<level1 id="([^"]*).*?)<level2>/$1<level2 id= +\"$2\">/gsi ) { }
    I feel dirty just posting that though, as it's so flaky a single white space would break it. Think of it as an example as to why you should fix your file format :)

    UPDATE

    I couldn't shake the feeling inflicted on myself by the above post. I seek to redeem myself with a full xml version :)

    #!/usr/bin/perl use XML::DOM; use warnings; use strict; my $xml = q|<root> <level1 id="L1_0001"> <level2> <level3> <level4/> </level3> <level3> <level4/> </level3> </level2> <level2/> </level1> <level1 id="L1_0002"> </level1> <level1 id="L1_0003"> <level2> <level3/> </level2> </level1> </root>|; my $parser = new XML::DOM::Parser; my $doc = $parser->parse($xml); foreach my $l1_node ($doc->getElementsByTagName ('level1') ) { my $current_id = $l1_node->getAttribute('id'); foreach my $l2_node ($l1_node->getElementsByTagName ('level2') ) { $l2_node->setAttribute('id', $current_id); } } print $doc->toString; exit();
    ---
    my name's not Keith, and I'm not reasonable.
Re: Regex problem while parsing tagged, hierarchical data
by mk. (Friar) on Sep 12, 2006 at 15:10 UTC
    hope this helps:
    perl -i -pe 'if (/<level1 id=\"(.*)\">/) {$currentid = $1;} elsif (/<level2>/) {s/<level2>/<level2 id="$currentid">/g}' input

    edit: just deleting the redundant $_.


    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "one who asks a question is a fool for five minutes; one who does not ask a question remains a fool forever."

    mk at rio dot pm dot org
      perl -pi -e 'if(/<level1(\sid=\".*\")>/){$a=$1};if(/<level2>/){$_=~s/( +<level2)>/$1$a>/g}' input or perl -pi -e '$a=m/(<level([12]))(\sid=\".*\")*>/;if($a){$2==1?$b=$3:$_ +=~s/($1)/$1$b/g;}' input or perl -pi -e '$a=m/(<level(1|2))(\sid=\".*\")*>/;$a?$2==1?$b=$3:$_=~s/( +$1)/$1$b/g:next;' input or perl -pi -e '$a=m;(<level(1|2))(\sid=\".*\")*>;;$a?$2==1?$b=$3:$_=~s;( +$1);$1$b;g:1;' input
      or golf by monsieur_champs
      perl -pi -e 'm;(<level(1|2))(\sid=\".*\")*>;;$&?$2==1?$b=$3:$_=~s;($1) +;$1$b;g:1;' input
      sorry =) -1 char
      perl -pi -e 'm;(<level(1|2))(\sid=\".*\")*>;;$&?$2==1?$b=$3:$_=~s;($1) +;$1$b;g:1' input
      insane !!! 33 chars
      perl -pi -e '/l1(\sid=.*)>/?$a=$1:s;l2>;l2$a>;' input

      =)


      Mago
      mago@rio.pm.org


        perl -pi -e '/(<level(1|2))(\sid=\".*\")*>/;$&?$2==1?$b=$3:$_=~s;($1);$1$b;g:1;' input
Re: Regex problem while parsing tagged, hierarchical data
by prasadbabu (Prior) on Sep 12, 2006 at 15:03 UTC

    Sivaraman,

    I can see couple of mistakes in your code. I don't know why you are using '*?' when you are already using negative look behind. Also you are matching only the first <level2> from <level1> in your coding, so it is not matching.

    Also it is good to use Parser for these kind of works like jdtoronto and Fletch suggested.

    Though not efficient, here is my try in regex to match all <level2> inside <level1>

    $content =~s#(<level1 id=\"([^"]*)\"(?:(?!(?:<level1|$)).)*)(<level1)?#my $id = $2; my $level1 = $1; my $rest = $3; ($level1 =~ s|<level2>|<level2 id="$id">|g);$level1.$rest#egsi;

    update: sivaraman, immediately, i added '?' in (<level1)? after i posted. Sorry for not mentioning the updation in the node.

    Prasad

    A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://572538]
Approved by jdtoronto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-20 02:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found