Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

embedded table remover

by BigJoe (Curate)
on May 26, 2000 at 11:15 UTC ( #14932=sourcecode: print w/replies, xml ) Need Help??
Category: HTML Utilities
Author/Contact Info Big Joe
Description: This script you can run on a html document to remove all embedded tables that are in it. Assuming that the tables were programmed into the document correctly. By default it will remove all embedded and leave the main table but you can also tell how many embedded tables are allowed by changing the numofTables variable.
#!/usr/bin/perl -w

open(INFILE, $inputfile) or die ("no file $inputfile");
$filesize = -s INFILE;
read(INFILE, $thispage, $filesize);

#this removes anypage breaks
$thispage=~s/<BR>/ /g;
$thispage=~s/<\/BR>/ /g;

@myarray=split("\s", $thispage);
open(OUTFILE, $outputfile);

#this is not to clean but the ASP that wrote the HTML 
#put the table tags and script tags on their own line
    if(($_ =~ m/<TABLE/)||($_ =~ m/<SCRIPT/))
    print OUTFILE "$_\s";
    if($_ =~ m/<\/TABLE>/)
        print OUTFILE "</TR><TR>\n<TD>";
    }elsif($_ =~ m/<\/SCRIPT>/){

Replies are listed 'Best First'.
RE: embedded table remover
by merlyn (Sage) on May 27, 2000 at 00:01 UTC
    Perhaps a more robust (and shorter) solution can be created on top of HTML::Table, part of LWP. Amazing how much reinvention happens (creating more fragile solutions) when you don't check the CPAN first. :)

    -- Randal L. Schwartz, Perl hacker

      HTML::Table is used for creating tables, rather than reading them. I suspect you meant HTML::TableExtract?

      Again, however, I suspect that that won't really work either as it discards all information that it doesn't need.

      You probably just want to build a handler onto HTML::Parser:

      #!/usr/bin/perl -w use strict; use HTML::Parser; my $in_table = 0; my $p = HTML::Parser->new( default_h => [ sub { print shift unless $in_table }, 'text'], start_h => [ sub { shift eq 'table' ? $in_table++ : $in_table || print shift }, 'tagname, text'], end_h => [ sub { shift eq 'table' ? $in_table-- : $in_table || print shift }, 'tagname, text'], ); $p->parse_file(shift || die "Need a file") || die $!;


      I read up on that and really didn't understand it. It showed how to access the data but I wanted to just remove all the embedded tables.
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://14932]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2020-09-19 10:39 GMT
Find Nodes?
    Voting Booth?
    If at first I donít succeed, I Ö

    Results (114 votes). Check out past polls.