Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Parsing Table

by tej (Scribe)
on Aug 17, 2010 at 09:26 UTC ( [id://855441]=perlquestion: print w/replies, xml ) Need Help??

tej has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse tables. I want to collect all cells in array. I am trying to do it with pattern matchin but i am unable to write correct pattern.

Input will be like :
<Tr><Tc>PA Group (N <26> 23)<Tc>COM Group (N <26> 24)<Tc> <Tr>Gender<Tc><Tc><Tc><124><sup>2<reset> test <26> 0.216, <mdit>df<med +> <26> 1, <mdit>P<med> <26> 0.642 <Tr><ems>Male (%)<Tc>14<ths>(60.9)<Tc>13<ths>(54.2)<Tc> <Tr><ems>Female (%)<Tc>9<ths>(39.1)<Tc>11<ths>(45.8)<Tc> <Tr>Ethnicity<Tc><Tc><Tc><124><sup>2<reset> test <26> 24.99, <mdit>df< +med> <26> 4, <mdit>P<med> <178> 0.001 <Tr><ems>African American (%)<Tc>5<ths>(21.7)<Tc>2<ths>(8.3)<Tc> <Tr><ems>European American (%)<Tc>5<ths>(21.7)<Tc>17<ths>(70.8)<Tc> <Tr><ems>Asian American (%)<Tc>0<Tc>4<ths>(16.7)<Tc> <Tr><ems>Hispanic American (%)<Tc>0<Tc>1<ths>(4.2)<Tc> <Tr><ems>Other (%)<Tc>9<ths>(39.1)<Tc>0<Tc> <Tr>Age, yr (SD)<Tc>46.05<ths>(6.13)<Tc>30.35<ths>(10.85)<Tc><mdit>t<m +ed> <26> 5.94, <mdit>P<med> <178> 0.001 <Tr>Education, yr (SD)<Tc>11.37<ths>(2.31)<Tc>15.85<ths>(1.75)<Tc><mdi +t>t<med> <26> 7.41, <mdit>P<med> <178> 0.001 <Tr>Pain Threshold, <28>C (SD)<Tc>48.75<ths>(2.44)<Tc>47.33<ths>(3.24) +<Tc>U* <26> 158.0, <mdit>P<med> <26> 0.012 <Tr;;4><ems>Males (SD) <26> 47.77 (3.35)&dagger; <Tr;;4><ems>Females (SD) <26> 48.26 (2.36)&dagger; <Tr;;4><ems>U&Dagger; <26> 244, <mdit>P<med> <26> 0.582<endtab>

I am giving pattern as /(<Tr(?:\;)*(?:\d)*>(.*?)<Tc>)+/

but with this i am unable to caputre all cells of a table.

Please help me with correct pattern. Thank you

Replies are listed 'Best First'.
Re: Parsing Table
by marto (Cardinal) on Aug 17, 2010 at 09:37 UTC
Re: Parsing Table
by Anonymous Monk on Aug 17, 2010 at 09:32 UTC
    No. Use a parser.
Re: Parsing Table
by Your Mother (Archbishop) on Aug 17, 2010 at 18:28 UTC

    There are some parsing idiosyncrasies (like tags are all considered lowercase within the parser) but the HTML::TokeParser family is usually quite good for anything that is SGMLish.

    Try this. It should be close to what you want already and pretty obvious how to adapt. See also: HTML::TokeParser::Simple. (Update: pulled YAML from sample code, it wasn't there for any reason.)

    use warnings; use strict; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(\*DATA); while ( my $token = $p->get_tag() ) { if ( $token->get_tag =~ /\Atr(;+\d+)?\z/ ) { while ( $p->peek and $p->peek !~ /<(Tr\b|endtab)/ ) { my $token = $p->get_token or next; print $token->as_is, " + "; } print "\n"; } } __DATA__ <Tr><Tc>PA Group (N <26> 23)<Tc>COM Group (N <26> 24)<Tc> <Tr>Gender<Tc><Tc><Tc><124><sup>2<reset> test <26> 0.216, <mdit>df<med +> <26> 1, <mdit>P<med> <26> 0.642 <Tr><ems>Male (%)<Tc>14<ths>(60.9)<Tc>13<ths>(54.2)<Tc> <Tr><ems>Female (%)<Tc>9<ths>(39.1)<Tc>11<ths>(45.8)<Tc> <Tr>Ethnicity<Tc><Tc><Tc><124><sup>2<reset> test <26> 24.99, <mdit>df< +med> <26> 4, <mdit>P<med> <178> 0.001 <Tr><ems>African American (%)<Tc>5<ths>(21.7)<Tc>2<ths>(8.3)<Tc> <Tr><ems>European American (%)<Tc>5<ths>(21.7)<Tc>17<ths>(70.8)<Tc> <Tr><ems>Asian American (%)<Tc>0<Tc>4<ths>(16.7)<Tc> <Tr><ems>Hispanic American (%)<Tc>0<Tc>1<ths>(4.2)<Tc> <Tr><ems>Other (%)<Tc>9<ths>(39.1)<Tc>0<Tc> <Tr>Age, yr (SD)<Tc>46.05<ths>(6.13)<Tc>30.35<ths>(10.85)<Tc><mdit>t<m +ed> <26> 5.94, <mdit>P<med> <178> 0.001 <Tr>Education, yr (SD)<Tc>11.37<ths>(2.31)<Tc>15.85<ths>(1.75)<Tc><mdi +t>t<med> <26> 7.41, <mdit>P<med> <178> 0.001 <Tr>Pain Threshold, <28>C (SD)<Tc>48.75<ths>(2.44)<Tc>47.33<ths>(3.24) +<Tc>U* <26> 158.0, <mdit>P<med> <26> 0.012 <Tr;;4><ems>Males (SD) <26> 47.77 (3.35)&dagger; <Tr;;4><ems>Females (SD) <26> 48.26 (2.36)&dagger; <Tr;;4><ems>U&Dagger; <26> 244, <mdit>P<med> <26> 0.582<endtab>
Re: Parsing Table
by dasgar (Priest) on Aug 17, 2010 at 15:06 UTC

    I admit that I'm rusty on my HTML code, but I'm wondering if this is correct HTML formatting. If it's not, then the HTML parsers might not be able to parse this data correctly.

    The reason that I'm questioning the formatting is:

    • I don't see the table HTML tags. (I would have suggested keying in on the table tags, but I don't see them in the sample data that you provided.)
    • I see the open TR tags, but not the corresponding close TR tags.
    • I see some tags that I don't recognize as being HTML (Tc, ems, ths, mdit, ...).
    As I said, I'm rusty on my HTML code, so I could be totally wrong about the formatting. I apologize if I am wrong about that.

Re: Parsing Table
by prasadbabu (Prior) on Aug 17, 2010 at 09:42 UTC

    Hi tej,

    As marto said it is always good to use parsers to have neat solution. Anyhow you have not given your exact output, so below code may satisfy your requirement.

    $str = '<Tr><Tc>PA Group (N <26> 23)<Tc>COM Group (N <26> 24)<Tc> <Tr>Gender<Tc><Tc><Tc><124><sup>2<reset> test <26> 0.216, <mdit>df<med +> <26> 1, <mdit>P<med> <26> 0.642 <Tr><ems>Male (%)<Tc>14<ths>(60.9)<Tc>13<ths>(54.2)<Tc> <Tr><ems>Female (%)<Tc>9<ths>(39.1)<Tc>11<ths>(45.8)<Tc> <Tr>Ethnicity<Tc><Tc><Tc><124><sup>2<reset> test <26> 24.99, <mdit>df< +med> <26> 4, <mdit>P<med> <178> 0.001 <Tr><ems>African American (%)<Tc>5<ths>(21.7)<Tc>2<ths>(8.3)<Tc> <Tr><ems>European American (%)<Tc>5<ths>(21.7)<Tc>17<ths>(70.8)<Tc> <Tr><ems>Asian American (%)<Tc>0<Tc>4<ths>(16.7)<Tc> <Tr><ems>Hispanic American (%)<Tc>0<Tc>1<ths>(4.2)<Tc> <Tr><ems>Other (%)<Tc>9<ths>(39.1)<Tc>0<Tc> <Tr>Age, yr (SD)<Tc>46.05<ths>(6.13)<Tc>30.35<ths>(10.85)<Tc><mdit>t<m +ed> <26> 5.94, <mdit>P<med> <178> 0.001 <Tr>Education, yr (SD)<Tc>11.37<ths>(2.31)<Tc>15.85<ths>(1.75)<Tc><mdi +t>t<med> <26> 7.41, <mdit>P<med> <178> 0.001 <Tr>Pain Threshold, <28>C (SD)<Tc>48.75<ths>(2.44)<Tc>47.33<ths>(3.24) +<Tc>U* <26> 158.0, <mdit>P<med> <26> 0.012 <Tr;;4><ems>Males (SD) <26> 47.77 (3.35)&dagger; <Tr;;4><ems>Females (SD) <26> 48.26 (2.36)&dagger; <Tr;;4><ems>U&Dagger; <26> 244, <mdit>P<med> <26> 0.582<endtab>'; $str = qr{$str}; while ($str =~ m|((<Tr(?:\;)*(?:\d)*>)([^\n]*)<Tc>)|g){ push (@cells, $1) } print @cells;

    Prasad

Re: Parsing Table
by tej (Scribe) on Aug 17, 2010 at 17:05 UTC

    Though table code looks like HTML code..The input file is neither completely HTML nor XML, which parser should i use in that case?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://855441]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-24 23:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found