table within table

Replies are listed 'Best First'.
Re: table within table by CountZero (Bishop) on Feb 10, 2010 at 13:29 UTC
Have a look at HTML::TableExtract. The description of this module says: Depth and Count are more specific ways to specify tables in relation to one another. Depth represents how deeply a table resides in other tables. The depth of a top-level table in the document is 0. A table within a top-level table has a depth of 1, and so on. Each depth can be thought of as a layer; tables sharing the same depth are on the same layer. Within each of these layers, Count represents the order in which a table was seen at that depth, starting with 0. Providing both a depth and a count will uniquely specify a table within a document. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: table within table by Anonymous Monk on Feb 10, 2010 at 10:44 UTC
Do you feel lucky? extract table, extract table site:perlmonks.org, extract table site:cpan.org,	[reply]
Re: table within table by quester (Vicar) on Feb 10, 2010 at 10:55 UTC
Umm... you need some more explanation here. What determines the order of the tables in the output? (We might be able to guess if we knew which of the two test3's in your output was intended to have been test4 instead...)	[reply]
Re: table within table by Utilitarian (Vicar) on Feb 10, 2010 at 11:59 UTC
Hi and good gay to you sir Programming 101 - look at the problem slowly and describe what needs to be done in simple steps Find a table tag preceded by a td tag Store that tag and everything up to the end of td tag Repeat this globally for the html body Take a look at perlre, try to implement a solution and come back to us with code that shows any problem you are having. Anon below is correct, rolling your own regex for a pre-tokenised format is the wrong approach. `print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."`	[reply] [d/l]
Re^2: table within table by Anonymous Monk on Feb 10, 2010 at 12:03 UTC
perlre is not for html	[reply]
Re^2: table within table by johncute (Initiate) on Feb 11, 2010 at 09:35 UTC
here is my code: $ctr=0; while(/(<table>[^\000]?<\/table>)/){ $text=$1; while($text=~/<table>/){ $tag=$&; $ctr=$ctr+1; $tag=~s/(<table)/\1$ctr/; $text=~s/<table>/$tag/; } $text=~s/(<table)$ctr>/\1_level$ctr>/g; $text=~s/(<\/table)>/\1_level$ctr>/g; $ctr=0; $text=~s/(<table)[0-9]+>/\1>/g; $text=~s/(<\/?)(thead\|tbody)([^>])?>//g; $text=~s/(<\/?)(th)([^>])?>/$1td>/g; while($text =~ /<a href="([^"])">[^\000]?<\/a>/){ $href = $1; $class = ""; if($href =~ /^http/i){ $class = "http";} if($href =~ /^www/i){ $class = "nohttp";} if($href =~ /^mailto/i){$class = "mailto";} if($href =~ /^ftp/i){ $class = "ftp";} if($class eq ""){ $text =~ s/<a href="([^"])">([^\000]?)<\/a>/\2/; }else{ $text =~ s/<a href="([^"])">([^\000]?)<\/a>/<remotelink href +class="$class" href="\1" >\2<\/remotelink>/; } } s/<table>[^\000]?<\/table>/$text/; } # Remove table and img tags inside table if an <img /> tag was encou +ntered while (/<table_level(2\|3)>[^\000]?<\/table_level\1>/) { $table2=$&; if ($table2 =~ /<img /) { # Remove all table tags including <img /> tag $table2=~s/<\/?(table_level(2\|3)\|tr\|td)(\s+[^>])?>\|<img\s+[^>]\/ +>//g; s/<table_level(2\|3)>[^\000]?<\/table_level\1>/$table2/; } else { $table2=~s/(<\/?table_level\d)/$1_temp/g; s/<table_level(2\|3)>[^\000]?<\/table_level\1>/$table2/; } } s/(<\/?table_level\d)_temp/$1/g; # Extract table inside table if no <img /> tag was encountered # inside the inner table. while (/(<table_level1>[^\000]?<\/table_level1>)/) { $table1=$1; #$table2=""; $table=""; while ($table1=~ /<table_level2>([^\000]?)<\/table_level2>/) { $table2=$1; $table=$&; # Extract inner table and place it after the second level table + $extracted_table3=""; while ($table2 =~ s/(<table_level3>[^\000]?<\/table_level3>)//) + { $extracted_table3="$extracted_table3\n$1"; } $table2=~s/<table_level2>([^\000]?)<\/table_level2>/$table$extr +acted_table3/g; #$table2=~s/(<table_level2>[^\000]?<\/table_level2>)/$1$extract +ed_table3/g; s/(<table_level2>[^\000]?<\/table_level2>)//; #$table2=~s/(<\/?table)_level2/$1_2/g; } $table1=~s/<table_level2>([^\000]?)<\/table_level2>//; $table1=~s/<table_2>([^\000]?)<\/table_2>//; $table1=~s/(<\/?table)_level1/$1/g; s/<table_level1>[^\000]?<\/table_level1>/$table2$table1/; } s/(<\/?table)_(level\d\|\d)/$1/g; [download] And here is my sample data <table> <tr> <td> <table> <thead> <tr> <th>Vill</th> <th>Hi</th> <th>Au</th> </tr> </thead> <tbody> <tr> <td>Aix</td> <td>40</td> <td>27</td> </tr> <tr> <td>Freib</td> <td>30</td> <td></td> </tr> <tr> <td>Gdan</td> <td>20</td> <td>13</td> </tr> <tr> <td>Gd</td> <td>44</td> <td>14</td> </tr> <tr> <td>Gren</td> <td>33</td> <td>22</td> </tr> <tr> <td>Karl</td> <td>26</td> <td></td> </tr> <tr> <td>La</td> <td>31</td> <td>18</td> </tr> <tr> <td></td> <td>30</td> <td>20</td> </tr> <tr> <td>Lyon</td> <td>41</td> <td>19</td> </tr> <tr> <td>Man</td> <td>22</td> <td></td> </tr> <tr> <td>Mar</td> <td>32</td> <td>18</td> </tr> <tr> <td>Mar</td> <td>17</td> <td>13</td> </tr> <tr> <td>Mon</td> <td>36</td> <td>26</td> </tr> <tr> <td>Mul</td> <td>30</td> <td>45</td> </tr> <tr> <td>Mun</td> <td>28</td> <td>23</td> </tr> <tr> <td>Nice</td> <td>41</td> <td>17</td> </tr> <tr> <td>Nims</td> <td>34</td> <td>25</td> </tr> <tr> <td>Nio</td> <td>29</td> <td>21</td> </tr> <tr> <td>Orleans</td> <td>32</td> <td>17</td> </tr> <tr> <td>Pad</td> <td>36</td> <td>20</td> </tr> <tr> <td>Paris</td> <td>24</td> <td>29</td> </tr> <tr> <td>Perk</td> <td>38</td> <td>29</td> </tr> <tr> <td>Poit</td> <td>27</td> <td>24</td> </tr> <tr> <td>Prag</td> <td>26</td> <td>16</td> </tr> <tr> <td></td> <td>23</td> <td>14</td> </tr> <tr> <td>Ren</td> <td>30</td> <td>18</td> </tr> <tr> <td>Rot</td> <td>36</td> <td>27</td> </tr> <tr> <td>Rou</td> <td>45</td> <td>22</td> </tr> <tr> <td>Saint</td> <td>33</td> <td>20</td> </tr> <tr> <td>Salon</td> <td>33</td> <td>18</td> </tr> <tr> <td>Sev</td> <td>63</td> <td>29</td> </tr> <tr> <td>Sop</td> <td>19</td> <td>8</td> </tr> <tr> <td>Stra</td> <td>28</td> <td>26</td> </tr> <tr> <td>Stut</td> <td>26</td> <td></td> </tr> <tr> <td>logne</td> <td>22</td> <td>11</td> </tr> <tr> <td>lon</td> <td>31</td> <td>22</td> </tr> <tr> <td>use</td> <td>28</td> <td>17</td> </tr> <tr> <td>Ts</td> <td>29</td> <td>22</td> </tr> <tr> <td>Val</td> <td>36</td> <td>23</td> </tr> <tr> <td>Zur</td> <td>29</td> <td>22</td> </tr> </tbody> </table> </td> <td> <table> <tr> <td><span><strong>Legend</strong></span></td> </tr> <tr> <td> <table> <thead> <tr> <th>head1</th> <th>head2</th> </tr> </thead> <tbody> <tr> <td>bon</td> <td></td> <td>0 / 25</td> </tr> <tr> <td>Ton</td> <td></td> <td>25 / 50</td> </tr> <tr> <td>Don</td> <td></td> <td>50 / 75</td> </tr> <tr> <td>Con</td> <td></td> <td>75 / 100</td> </tr> <tr> <td>Trs</td> <td></td> <td> 100</td> </tr> </tbody> </table> </td> </tr> </table> </td> </tr> <tr> <td colspan="2"></td> </tr> <tr> <td colspan="2">This is a sample content</td> </tr> <tr> <td colspan="2"></td> </tr> <tr> <td colspan="2">Site : <a href="http://www.yahoo.com" target=" +_blank">www.yahoo.com</a></td> </tr> </table> [download] The output should be like, if a table is consists of 3 levels. the level 2 should be at the top of level 1 then level 3 should be at the bottom of level 2. The output will be: level2 level3 level1 If there will be a table greater than the 3rd level or the deepest level, it will be outputted at the bottom of the 3rd level. Example level2 level3 level4 level5 level6 level7 level8 level1 Hope I explained it well.	[reply] [d/l] [select]
Re^3: table within table by Utilitarian (Vicar) on Feb 11, 2010 at 10:38 UTC
Seriously, this is better achieved using one of the HTML::Parser modules. For example take a look at HTML::TokeParser `use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new("file.html") # your source file \|\|die "Cant open: $!"; my $depth=0; while (my $token = $p->get_token) { if (lc(${$token}[1]) eq "table"){ $depth++ if (${$token}[0] eq "S"); $depth-- if (${$token}[0] eq "E"); print "$depth\n"; } }` [download] Try out the code above and see where it takes you `print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."`	[reply] [d/l] [select]

HTML::TableExtract

The description of this module says:

Depth and Count are more specific ways to specify tables in relation to one another. Depth represents how deeply a table resides in other tables. The depth of a top-level table in the document is 0. A table within a top-level table has a depth of 1, and so on. Each depth can be thought of as a layer; tables sharing the same depth are on the same layer. Within each of these layers, Count represents the order in which a table was seen at that depth, starting with 0. Providing both a depth and a count will uniquely specify a table within a document.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

[reply]

extract table

extract table site:perlmonks.org

extract table site:cpan.org

[reply]

Umm... you need some more explanation here. What determines the order of the tables in the output? (We might be able to guess if we knew which of the two test3's in your output was intended to have been test4 instead...)

[reply]

Programming 101 - look at the problem slowly and describe what needs to be done in simple steps

Find a table tag preceded by a td tag
Store that tag and everything up to the end of td tag
Repeat this globally for the html body

perlre

Anon below is correct, rolling your own regex for a pre-tokenised format is the wrong approach.

print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."

[reply]
[d/l]

perlre is not for html

[reply]

here is my code:

  $ctr=0;
  while(/(<table>[^\000]*?<\/table>)/){
    $text=$1;
    while($text=~/<table>/){
      $tag=$&;
      $ctr=$ctr+1;
      $tag=~s/(<table)/\1$ctr/;
      $text=~s/<table>/$tag/;
    }
    $text=~s/(<table)$ctr>/\1_level$ctr>/g;
    $text=~s/(<\/table)>/\1_level$ctr>/g;
    $ctr=0;
    $text=~s/(<table)[0-9]+>/\1>/g;
    $text=~s/(<\/?)(thead|tbody)([^>]*)?>//g;
    $text=~s/(<\/?)(th)([^>]*)?>/$1td>/g;
    while($text =~ /<a href="([^"]*)">[^\000]*?<\/a>/){
      $href = $1;
      $class = "";
      if($href =~ /^http/i){  $class = "http";}
      if($href =~ /^www/i){   $class = "nohttp";}
      if($href =~ /^mailto/i){$class = "mailto";}
      if($href =~ /^ftp/i){   $class = "ftp";}
      if($class eq ""){
        $text =~ s/<a href="([^"]*)">([^\000]*?)<\/a>/\2/;
      }else{
        $text =~ s/<a href="([^"]*)">([^\000]*?)<\/a>/<remotelink href
+class="$class" href="\1" >\2<\/remotelink>/;
      }
    }
    s/<table>[^\000]*?<\/table>/$text/;
  }

  # Remove table and img tags inside table if an <img /> tag was encou
+ntered
  while (/<table_level(2|3)>[^\000]*?<\/table_level\1>/) {
    $table2=$&;
    if ($table2 =~ /<img /) {
    # Remove all table tags including <img /> tag
    $table2=~s/<\/?(table_level(2|3)|tr|td)(\s+[^>]*)?>|<img\s+[^>]*\/
+>//g;
    s/<table_level(2|3)>[^\000]*?<\/table_level\1>/$table2/;
    }  
    else {
      $table2=~s/(<\/?table_level\d)/$1_temp/g;
      s/<table_level(2|3)>[^\000]*?<\/table_level\1>/$table2/;
    } 
  }
  s/(<\/?table_level\d)_temp/$1/g;

  # Extract table inside table if no <img /> tag was encountered
  # inside the inner table.
  while (/(<table_level1>[^\000]*?<\/table_level1>)/) {
    $table1=$1;
    #$table2="";
    $table="";
    while ($table1=~ /<table_level2>([^\000]*?)<\/table_level2>/) {
      $table2=$1;
      $table=$&;

      # Extract inner table and place it after the second level table 
+ 
      $extracted_table3="";
      while ($table2 =~ s/(<table_level3>[^\000]*?<\/table_level3>)//)
+ {
        $extracted_table3="$extracted_table3\n$1"; 
      }

      $table2=~s/<table_level2>([^\000]*?)<\/table_level2>/$table$extr
+acted_table3/g;
      #$table2=~s/(<table_level2>[^\000]*?<\/table_level2>)/$1$extract
+ed_table3/g;
      s/(<table_level2>[^\000]*?<\/table_level2>)//;
      #$table2=~s/(<\/?table)_level2/$1_2/g;
    }
    $table1=~s/<table_level2>([^\000]*?)<\/table_level2>//;
    $table1=~s/<table_2>([^\000]*?)<\/table_2>//;
    $table1=~s/(<\/?table)_level1/$1/g;
    s/<table_level1>[^\000]*?<\/table_level1>/$table2$table1/;
  }
  s/(<\/?table)_(level\d|\d)/$1/g;
[download]

And here is my sample data

<table>
    <tr>
        <td>
            <table>
                <thead>
                    <tr>
                        <th>Vill</th>
                        <th>Hi</th>
                        <th>Au</th>
                    </tr>
                </thead>
                <tbody>
                    <tr>
                        <td>Aix</td>
                        <td>40</td>
                        <td>27</td>
                    </tr>
                    <tr>
                        <td>Freib</td>
                        <td>30</td>
                        <td></td>
                    </tr>
                    <tr>
                        <td>Gdan</td>
                        <td>20</td>
                        <td>13</td>
                    </tr>
                    <tr>
                        <td>Gd</td>
                        <td>44</td>
                        <td>14</td>
                    </tr>
                    <tr>
                        <td>Gren</td>
                        <td>33</td>
                        <td>22</td>
                    </tr>
                    <tr>
                        <td>Karl</td>
                        <td>26</td>
                        <td></td>
                    </tr>
                    <tr>
                        <td>La</td>
                        <td>31</td>
                        <td>18</td>
                    </tr>
                    <tr>
                        <td></td>
                        <td>30</td>
                        <td>20</td>
                    </tr>
                    <tr>
                        <td>Lyon</td>
                        <td>41</td>
                        <td>19</td>
                    </tr>
                    <tr>
                        <td>Man</td>
                        <td>22</td>
                        <td></td>
                    </tr>
                    <tr>
                        <td>Mar</td>
                        <td>32</td>
                        <td>18</td>
                    </tr>
                    <tr>
                        <td>Mar</td>
                        <td>17</td>
                        <td>13</td>
                    </tr>
                    <tr>
                        <td>Mon</td>
                        <td>36</td>
                        <td>26</td>
                    </tr>
                    <tr>
                        <td>Mul</td>
                        <td>30</td>
                        <td>45</td>
                    </tr>
                    <tr>
                        <td>Mun</td>
                        <td>28</td>
                        <td>23</td>
                    </tr>
                    <tr>
                        <td>Nice</td>
                        <td>41</td>
                        <td>17</td>
                    </tr>
                    <tr>
                        <td>Nims</td>
                        <td>34</td>
                        <td>25</td>
                    </tr>
                    <tr>
                        <td>Nio</td>
                        <td>29</td>
                        <td>21</td>
                    </tr>
                    <tr>
                        <td>Orleans</td>
                        <td>32</td>
                        <td>17</td>
                    </tr>
                    <tr>
                        <td>Pad</td>
                        <td>36</td>
                        <td>20</td>
                    </tr>
                    <tr>
                        <td>Paris</td>
                        <td>24</td>
                        <td>29</td>
                    </tr>
                    <tr>
                        <td>Perk</td>
                        <td>38</td>
                        <td>29</td>
                    </tr>
                    <tr>
                        <td>Poit</td>
                        <td>27</td>
                        <td>24</td>
                    </tr>
                    <tr>
                        <td>Prag</td>
                        <td>26</td>
                        <td>16</td>
                    </tr>
                    <tr>
                        <td></td>
                        <td>23</td>
                        <td>14</td>
                    </tr>
                    <tr>
                        <td>Ren</td>
                        <td>30</td>
                        <td>18</td>
                    </tr>
                    <tr>
                        <td>Rot</td>
                        <td>36</td>
                        <td>27</td>
                    </tr>
                    <tr>
                        <td>Rou</td>
                        <td>45</td>
                        <td>22</td>
                    </tr>
                    <tr>
                        <td>Saint</td>
                        <td>33</td>
                        <td>20</td>
                    </tr>
                    <tr>
                        <td>Salon</td>
                        <td>33</td>
                        <td>18</td>
                    </tr>
                    <tr>
                        <td>Sev</td>
                        <td>63</td>
                        <td>29</td>
                    </tr>
                    <tr>
                        <td>Sop</td>
                        <td>19</td>
                        <td>8</td>
                    </tr>
                    <tr>
                        <td>Stra</td>
                        <td>28</td>
                        <td>26</td>
                    </tr>
                    <tr>
                        <td>Stut</td>
                        <td>26</td>
                        <td></td>
                    </tr>
                    <tr>
                        <td>logne</td>
                        <td>22</td>
                        <td>11</td>
                    </tr>
                    <tr>
                        <td>lon</td>
                        <td>31</td>
                        <td>22</td>
                    </tr>
                    <tr>
                        <td>use</td>
                        <td>28</td>
                        <td>17</td>
                    </tr>
                    <tr>
                        <td>Ts</td>
                        <td>29</td>
                        <td>22</td>
                    </tr>
                    <tr>
                        <td>Val</td>
                        <td>36</td>
                        <td>23</td>
                    </tr>
                    <tr>
                        <td>Zur</td>
                        <td>29</td>
                        <td>22</td>
                    </tr>
                </tbody>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td><span><strong>Legend</strong></span></td>
                </tr>
                <tr>
                    <td>
                        <table>
                            <thead>
                                <tr>
                                    <th>head1</th>
                                    <th>head2</th>
                                </tr>
                            </thead>
                            <tbody>
                                <tr>
                                        <td>bon</td>
                                        <td></td>
                                        <td>0 / 25</td>
                                </tr>
                                <tr>
                                        <td>Ton</td>
                                        <td></td>
                                        <td>25 / 50</td>
                                </tr>
                                <tr>
                                        <td>Don</td>
                                        <td></td>
                                        <td>50 / 75</td>
                                </tr>
                                <tr>
                                        <td>Con</td>
                                        <td></td>
                                        <td>75 / 100</td>
                                </tr>
                                <tr>
                                        <td>Trs</td>
                                        <td></td>
                                        <td> 100</td>
                                </tr>
                            </tbody>
                        </table>
                    </td>
                </tr>
            </table>
        </td>
    </tr>
    <tr>
        <td colspan="2"></td>
    </tr>
    <tr>
        <td colspan="2">This is a sample content</td>
    </tr>
    <tr>
        <td colspan="2"></td>
    </tr>
    <tr>
        <td colspan="2">Site : <a href="http://www.yahoo.com" target="
+_blank">www.yahoo.com</a></td>
    </tr>
</table>
[download]

The output should be like, if a table is consists of 3 levels. the level 2 should be at the top of level 1 then level 3 should be at the bottom of level 2.

The output will be:

level2

level3

level1

If there will be a table greater than the 3rd level or the deepest level, it will be outputted at the bottom of the 3rd level.

Example

level2

level3

level4

level5

level6

level7

level8

level1

Hope I explained it well.

[reply]
[d/l]
[select]

HTML::Parser

HTML::TokeParser

use strict;
use warnings;
use HTML::TokeParser;
my $p = HTML::TokeParser->new("file.html") # your source file
     ||die "Cant open: $!";
my $depth=0;
while (my $token = $p->get_token) { 
   if (lc(${$token}[1]) eq "table"){
      $depth++ if (${$token}[0] eq "S");
      $depth-- if (${$token}[0] eq "E");
   print "$depth\n";
   }
}
[download]

print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."

[reply]
[d/l]
[select]