http://qs321.pair.com?node_id=822386

johncute has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: table within table
by CountZero (Bishop) on Feb 10, 2010 at 13:29 UTC
    Have a look at HTML::TableExtract.

    The description of this module says:

    Depth and Count are more specific ways to specify tables in relation to one another. Depth represents how deeply a table resides in other tables. The depth of a top-level table in the document is 0. A table within a top-level table has a depth of 1, and so on. Each depth can be thought of as a layer; tables sharing the same depth are on the same layer. Within each of these layers, Count represents the order in which a table was seen at that depth, starting with 0. Providing both a depth and a count will uniquely specify a table within a document.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: table within table
by Anonymous Monk on Feb 10, 2010 at 10:44 UTC
Re: table within table
by quester (Vicar) on Feb 10, 2010 at 10:55 UTC
    Umm... you need some more explanation here. What determines the order of the tables in the output? (We might be able to guess if we knew which of the two test3's in your output was intended to have been test4 instead...)
Re: table within table
by Utilitarian (Vicar) on Feb 10, 2010 at 11:59 UTC
    Hi and good gay to you sir

    Programming 101 - look at the problem slowly and describe what needs to be done in simple steps

    • Find a table tag preceded by a td tag
    • Store that tag and everything up to the end of td tag
    • Repeat this globally for the html body
    Take a look at perlre, try to implement a solution and come back to us with code that shows any problem you are having.

    Anon below is correct, rolling your own regex for a pre-tokenised format is the wrong approach.

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
      perlre is not for html

      here is my code:

      $ctr=0; while(/(<table>[^\000]*?<\/table>)/){ $text=$1; while($text=~/<table>/){ $tag=$&; $ctr=$ctr+1; $tag=~s/(<table)/\1$ctr/; $text=~s/<table>/$tag/; } $text=~s/(<table)$ctr>/\1_level$ctr>/g; $text=~s/(<\/table)>/\1_level$ctr>/g; $ctr=0; $text=~s/(<table)[0-9]+>/\1>/g; $text=~s/(<\/?)(thead|tbody)([^>]*)?>//g; $text=~s/(<\/?)(th)([^>]*)?>/$1td>/g; while($text =~ /<a href="([^"]*)">[^\000]*?<\/a>/){ $href = $1; $class = ""; if($href =~ /^http/i){ $class = "http";} if($href =~ /^www/i){ $class = "nohttp";} if($href =~ /^mailto/i){$class = "mailto";} if($href =~ /^ftp/i){ $class = "ftp";} if($class eq ""){ $text =~ s/<a href="([^"]*)">([^\000]*?)<\/a>/\2/; }else{ $text =~ s/<a href="([^"]*)">([^\000]*?)<\/a>/<remotelink href +class="$class" href="\1" >\2<\/remotelink>/; } } s/<table>[^\000]*?<\/table>/$text/; } # Remove table and img tags inside table if an <img /> tag was encou +ntered while (/<table_level(2|3)>[^\000]*?<\/table_level\1>/) { $table2=$&; if ($table2 =~ /<img /) { # Remove all table tags including <img /> tag $table2=~s/<\/?(table_level(2|3)|tr|td)(\s+[^>]*)?>|<img\s+[^>]*\/ +>//g; s/<table_level(2|3)>[^\000]*?<\/table_level\1>/$table2/; } else { $table2=~s/(<\/?table_level\d)/$1_temp/g; s/<table_level(2|3)>[^\000]*?<\/table_level\1>/$table2/; } } s/(<\/?table_level\d)_temp/$1/g; # Extract table inside table if no <img /> tag was encountered # inside the inner table. while (/(<table_level1>[^\000]*?<\/table_level1>)/) { $table1=$1; #$table2=""; $table=""; while ($table1=~ /<table_level2>([^\000]*?)<\/table_level2>/) { $table2=$1; $table=$&; # Extract inner table and place it after the second level table + $extracted_table3=""; while ($table2 =~ s/(<table_level3>[^\000]*?<\/table_level3>)//) + { $extracted_table3="$extracted_table3\n$1"; } $table2=~s/<table_level2>([^\000]*?)<\/table_level2>/$table$extr +acted_table3/g; #$table2=~s/(<table_level2>[^\000]*?<\/table_level2>)/$1$extract +ed_table3/g; s/(<table_level2>[^\000]*?<\/table_level2>)//; #$table2=~s/(<\/?table)_level2/$1_2/g; } $table1=~s/<table_level2>([^\000]*?)<\/table_level2>//; $table1=~s/<table_2>([^\000]*?)<\/table_2>//; $table1=~s/(<\/?table)_level1/$1/g; s/<table_level1>[^\000]*?<\/table_level1>/$table2$table1/; } s/(<\/?table)_(level\d|\d)/$1/g;

      And here is my sample data

      <table> <tr> <td> <table> <thead> <tr> <th>Vill</th> <th>Hi</th> <th>Au</th> </tr> </thead> <tbody> <tr> <td>Aix</td> <td>40</td> <td>27</td> </tr> <tr> <td>Freib</td> <td>30</td> <td></td> </tr> <tr> <td>Gdan</td> <td>20</td> <td>13</td> </tr> <tr> <td>Gd</td> <td>44</td> <td>14</td> </tr> <tr> <td>Gren</td> <td>33</td> <td>22</td> </tr> <tr> <td>Karl</td> <td>26</td> <td></td> </tr> <tr> <td>La</td> <td>31</td> <td>18</td> </tr> <tr> <td></td> <td>30</td> <td>20</td> </tr> <tr> <td>Lyon</td> <td>41</td> <td>19</td> </tr> <tr> <td>Man</td> <td>22</td> <td></td> </tr> <tr> <td>Mar</td> <td>32</td> <td>18</td> </tr> <tr> <td>Mar</td> <td>17</td> <td>13</td> </tr> <tr> <td>Mon</td> <td>36</td> <td>26</td> </tr> <tr> <td>Mul</td> <td>30</td> <td>45</td> </tr> <tr> <td>Mun</td> <td>28</td> <td>23</td> </tr> <tr> <td>Nice</td> <td>41</td> <td>17</td> </tr> <tr> <td>Nims</td> <td>34</td> <td>25</td> </tr> <tr> <td>Nio</td> <td>29</td> <td>21</td> </tr> <tr> <td>Orleans</td> <td>32</td> <td>17</td> </tr> <tr> <td>Pad</td> <td>36</td> <td>20</td> </tr> <tr> <td>Paris</td> <td>24</td> <td>29</td> </tr> <tr> <td>Perk</td> <td>38</td> <td>29</td> </tr> <tr> <td>Poit</td> <td>27</td> <td>24</td> </tr> <tr> <td>Prag</td> <td>26</td> <td>16</td> </tr> <tr> <td></td> <td>23</td> <td>14</td> </tr> <tr> <td>Ren</td> <td>30</td> <td>18</td> </tr> <tr> <td>Rot</td> <td>36</td> <td>27</td> </tr> <tr> <td>Rou</td> <td>45</td> <td>22</td> </tr> <tr> <td>Saint</td> <td>33</td> <td>20</td> </tr> <tr> <td>Salon</td> <td>33</td> <td>18</td> </tr> <tr> <td>Sev</td> <td>63</td> <td>29</td> </tr> <tr> <td>Sop</td> <td>19</td> <td>8</td> </tr> <tr> <td>Stra</td> <td>28</td> <td>26</td> </tr> <tr> <td>Stut</td> <td>26</td> <td></td> </tr> <tr> <td>logne</td> <td>22</td> <td>11</td> </tr> <tr> <td>lon</td> <td>31</td> <td>22</td> </tr> <tr> <td>use</td> <td>28</td> <td>17</td> </tr> <tr> <td>Ts</td> <td>29</td> <td>22</td> </tr> <tr> <td>Val</td> <td>36</td> <td>23</td> </tr> <tr> <td>Zur</td> <td>29</td> <td>22</td> </tr> </tbody> </table> </td> <td> <table> <tr> <td><span><strong>Legend</strong></span></td> </tr> <tr> <td> <table> <thead> <tr> <th>head1</th> <th>head2</th> </tr> </thead> <tbody> <tr> <td>bon</td> <td></td> <td>0 / 25</td> </tr> <tr> <td>Ton</td> <td></td> <td>25 / 50</td> </tr> <tr> <td>Don</td> <td></td> <td>50 / 75</td> </tr> <tr> <td>Con</td> <td></td> <td>75 / 100</td> </tr> <tr> <td>Trs</td> <td></td> <td> 100</td> </tr> </tbody> </table> </td> </tr> </table> </td> </tr> <tr> <td colspan="2"></td> </tr> <tr> <td colspan="2">This is a sample content</td> </tr> <tr> <td colspan="2"></td> </tr> <tr> <td colspan="2">Site : <a href="http://www.yahoo.com" target=" +_blank">www.yahoo.com</a></td> </tr> </table>

      The output should be like, if a table is consists of 3 levels. the level 2 should be at the top of level 1 then level 3 should be at the bottom of level 2.

      The output will be:

      level2

      level3

      level1

      If there will be a table greater than the 3rd level or the deepest level, it will be outputted at the bottom of the 3rd level.

      Example

      level2

      level3

      level4

      level5

      level6

      level7

      level8

      level1

      Hope I explained it well.

        Seriously, this is better achieved using one of the HTML::Parser modules. For example take a look at HTML::TokeParser
        use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new("file.html") # your source file ||die "Cant open: $!"; my $depth=0; while (my $token = $p->get_token) { if (lc(${$token}[1]) eq "table"){ $depth++ if (${$token}[0] eq "S"); $depth-- if (${$token}[0] eq "E"); print "$depth\n"; } }
        Try out the code above and see where it takes you

        print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."