http://qs321.pair.com?node_id=14932
Category: HTML Utilities
Author/Contact Info Big Joe Big_Joe1008@linuxstart.com
Description: This script you can run on a html document to remove all embedded tables that are in it. Assuming that the tables were programmed into the document correctly. By default it will remove all embedded and leave the main table but you can also tell how many embedded tables are allowed by changing the numofTables variable.
#!/usr/bin/perl -w
$inputfile="test.htm";
$outputfile=">outfile2.html";
$numofTables=1;


open(INFILE, $inputfile) or die ("no file $inputfile");
$filesize = -s INFILE;
read(INFILE, $thispage, $filesize);
close(INFILE);

#this removes anypage breaks
$thispage=~s/<BR>/ /g;
$thispage=~s/<\/BR>/ /g;


@myarray=split("\s", $thispage);
open(OUTFILE, $outputfile);


$start=0;
foreach(@myarray){
#this is not to clean but the ASP that wrote the HTML 
#put the table tags and script tags on their own line
    if(($_ =~ m/<TABLE/)||($_ =~ m/<SCRIPT/))
    {
        $start++;
    }
    if($start<=$numofTables){
    print OUTFILE "$_\s";
    }
    if($_ =~ m/<\/TABLE>/)
    {
        $start--;
        print OUTFILE "</TR><TR>\n<TD>";
    }elsif($_ =~ m/<\/SCRIPT>/){
        $start--;
    }
} 



close(OUTFILE);
Replies are listed 'Best First'.
RE: embedded table remover
by merlyn (Sage) on May 27, 2000 at 00:01 UTC
    Perhaps a more robust (and shorter) solution can be created on top of HTML::Table, part of LWP. Amazing how much reinvention happens (creating more fragile solutions) when you don't check the CPAN first. :)

    -- Randal L. Schwartz, Perl hacker

      HTML::Table is used for creating tables, rather than reading them. I suspect you meant HTML::TableExtract?

      Again, however, I suspect that that won't really work either as it discards all information that it doesn't need.

      You probably just want to build a handler onto HTML::Parser:

      #!/usr/bin/perl -w use strict; use HTML::Parser; my $in_table = 0; my $p = HTML::Parser->new( default_h => [ sub { print shift unless $in_table }, 'text'], start_h => [ sub { shift eq 'table' ? $in_table++ : $in_table || print shift }, 'tagname, text'], end_h => [ sub { shift eq 'table' ? $in_table-- : $in_table || print shift }, 'tagname, text'], ); $p->parse_file(shift || die "Need a file") || die $!;

      Tony

      I read up on that and really didn't understand it. It showed how to access the data but I wanted to just remove all the embedded tables.