Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Finding data in html file

by Griffler (Novice)
on Sep 21, 2007 at 19:37 UTC ( #640430=perlquestion: print w/replies, xml ) Need Help??

Griffler has asked for the wisdom of the Perl Monks concerning the following question:

I have this file that looks like this:
<table id="a" border="1" bordercolor="#333366" cellpadding="5" cellspa +cing="0" width="100%"> <tr> <td width="33%" class="clsTableBody" valign +="top" id="firstCol"><a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc04 +4.pdf" target="_blank">Abbott, Evelyn</a><br/><span>1953-05-28</span> +<br/><a href="pdf\2aae5e89-4370-4b31-bbbf-c6c39d5761bd.pdf" target="_ +blank">Addison, Cheryl</a><br/><span>1958-05-26</span><br/><a href="p +df\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" target="_blank">Agnew, T +homas</a><br/><span>1953-06-15</span><br/><a href="pdf\67ac0d08-f295- +4e37-bdf3-9e2c22a6f579.pdf" target="_blank">Albert, Carole</a><br/><s +pan>1968-01-30</span><br/><a href="pdf\1947e19b-0073-4b9a-94c4-176e4b +15b4bb.pdf" target="_blank">Albert, Franklin</a><br/><span>1957-10-22 +</span><br/><a href="pdf\b3a1661d-6e51-49e4-835f-385a8880307a.pdf" ta +rget="_blank">Albert, Gaston</a><br/><span>1951-12-18</span><br/><a h +ref="pdf\8a32f738-acb9-446a-aab6-e352887cbb25.pdf" target="_blank">Al +bert, Kim</a><br/><span>1967-08-08</span><br/><a href="pdf\4e3d1d45-d +a1b-4af9-9450-3bfd4df96f1c.pdf" target="_blank">Alchorn, James A.</a> +<br/><span>1961-10-30</span><br/><a href="pdf\855dee39-5573-4610-b26e +-2567211ca01a.pdf" target="_blank">Allaby, Sue Ellen</a><br/><span>19 +64-09-13</span><br/><a href="pdf\447ebebb-8c68-4670-83ad-ebcc43e0eccc +.pdf" target="_blank">Allain, Jack</a><br/><span>1944-11-20</span><br +/><a href="pdf\e7fcbace-d956-4e3a-b78a-d0c737c21cf5.pdf" target="_bla +nk">Allain, John J.</a><br/><span>1965-09-02</span><br/><a href="pdf\ +6864afc6-1d68-453c-940e-8d6c4500e15d.pdf" target="_blank">Allain, Rhe +al</a><br/><span>1954-08-01</span><br/><a href="pdf\0c345afa-5bd4-43b +a-b2fd-7b93398c58e4.pdf" target="_blank">Allain, Rosario</a><br/><spa +n>1951-10-27</span><br/><a href="pdf\690239be-3d99-4f42-b4b0-669ef2e5 +df85.pdf" target="_blank">Allen, Graham</a><br/><span>1965-02-20</spa +n><br/><a href="pdf\9c52b0f8-8342-4658-ab20-e85c2c1f1ee9.pdf" target= +"_blank">Allen, John W.</a><br/><span>1950-11-15</span><br/><a href=" +pdf\a87ea9c5-ad8f-4c81-a07c-f5db87142754.pdf" target="_blank">Allen, +Kevin</a><br/><span>1975-12-23</span><br/><a href="pdf\952ad3be-4381- +455a-a5bd-0fccc1608a66.pdf" target="_blank">Allen, Patricia</a><br/>< +span>1952-03-30</span><br/><a href="pdf\1c268749-7d37-4055-9498-73d9f +1d4d7bb.pdf" target="_blank">Allen, Paulette S.</a><br/><span>1951-12 +-27</span><br/><a href="pdf\7893d592-92ec-4c62-95eb-34e28f361fa5.pdf" + target="_blank">Allen, W. Doug</a><br/><span>1946-09-28</span><br/>< +a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" target="_blank" +>Allison, Gary Owen</a><br/><span>1958-10-02</span><br/><a href="pdf\ +deebbfaf-3e1d-4342-8bde-7aa3128d7142.pdf" target="_blank">Alward, Hal +dean</a><br/><span>1950-11-04</span><br/><a href="pdf\eb5ec017-c6a5-4 +120-bb5c-bb1b2248cf34.pdf" target="_blank">Amos, Alfred</a><br/><span +>1950-05-31</span><br/></td> <td width="34%" class="clsTableBodyClear" v +align="top" id="secondCol"><a href="pdf\fe74c873-d3ed-4053-960b-e7355 +38167b6.pdf" target="_blank">Amos, Allen R.</a><br/><span>1950-05-31< +/span><br/><a href="pdf\575c388c-42b0-400e-80ca-b3cec73df9cc.pdf" tar +get="_blank">Anderson, C. Frederick</a><br/><span>1951-11-26</span><b +r/><a href="pdf\ea944df7-9da1-4953-b200-5f06876160a8.pdf" target="_bl +ank">Anderson, Jason</a><br/><span>1970-01-27</span><br/><a href="pdf +\61226f6e-838a-437c-9cb9-fa13f805684a.pdf" target="_blank">Anderson, +Joan</a><br/><span>1948-10-02</span><br/><a href="pdf\31c3c9a3-79db-4 +e2d-b092-346e23a6f857.pdf" target="_blank">Anthony, Elizabeth</a><br/ +><span>1961-02-23</span><br/><a href="pdf\70c758ce-2d38-4af1-9512-738 +2def72ac1.pdf" target="_blank">Anthony, Stephen</a><br/><span>1965-05 +-12</span><br/><a href="pdf\2aee840d-acb9-46c0-ad8c-4dec2fca67e5.pdf" + target="_blank">Arbeau, Bev</a><br/><span>1948-07-09</span><br/><a h +ref="pdf\4f054b87-fc9d-4d1e-b3b3-d51e4bfac9a0.pdf" target="_blank">Ar +beau, Marshall</a><br/><span>1949-09-05</span><br/><a href="pdf\794a7 +e41-71c4-4736-a184-3a048d85a283.pdf" target="_blank">Arbeau, Ricky C. +B.</a><br/><span>1960-02-08</span><br/><a href="pdf\f6180b5e-17d2-48c +9-a4a8-aee3ef874c77.pdf" target="_blank">Armstrong, David J.</a><br/> +<span>1956-09-09</span><br/><a href="pdf\5fd4d40b-53e4-4330-8dca-f214 +69b2d2e5.pdf" target="_blank">Armstrong, Jo-Ann</a><br/><span>1955-08 +-08</span><br/><a href="pdf\ac6fdde6-e68e-4c81-ae10-cd935fe18ee9.pdf" + target="_blank">Armstrong, Richard</a><br/><span>1954-12-26</span><b +r/><a href="pdf\efcec029-990e-4d52-8d40-d67434f4c251.pdf" target="_bl +ank">Arpin, Pierre</a><br/><span>1954-11-20</span><br/><a href="pdf\0 +d5af644-0b4d-4609-93a7-eac1a4ae2b9d.pdf" target="_blank">Arsenault, A +ntonio J.</a><br/><span>1965-08-17</span><br/><a href="pdf\03aa785d-c +848-49aa-a5b2-830e15f97028.pdf" target="_blank">Arsenault, Charles</a +><br/><span>1958-12-27</span><br/><a href="pdf\146f090b-faff-4721-8b1 +3-0242e1ec50bf.pdf" target="_blank">Arsenault, David J.</a><br/><span +>1947-06-10</span><br/><a href="pdf\dbe2e5d1-74f5-4b47-bb18-bb40da265 +a66.pdf" target="_blank">Arsenault, Donald</a><br/><span>1959-11-05</ +span><br/><a href="pdf\2a03629f-e723-43dd-b10c-4a55dbdb7974.pdf" targ +et="_blank">Arsenault, Earl</a><br/><span>1949-03-23</span><br/><a hr +ef="pdf\b05ec9e4-2142-457f-9c24-d842157b57da.pdf" target="_blank">Ars +enault, Jacqueline J.</a><br/><span>1954-12-21</span><br/><a href="pd +f\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf" target="_blank">Arsenault +, Michael</a><br/><span>1953-02-25</span><br/><a href="pdf\e646f948-f +78d-4463-a01d-0261aebf70dc.pdf" target="_blank">Arsenault, Normand A. +</a><br/><span>1957-02-10</span><br/><a href="pdf\d964a7e1-4e8d-459c- +812b-2124a6896438.pdf" target="_blank">Arsenault, Rheal</a><br/><span +>1952-12-08</span><br/></td> <td width="33%" class="clsTableBody" valign +="top" id="thirdCol"><a href="pdf\0d5ad8f8-98da-4abb-b69d-36bb98d7f6c +4.pdf" target="_blank">Arsenault, Thaddée</a><br/><span>1958-08-01</s +pan><br/><a href="pdf\f582adad-82c8-44c2-ab38-18edfcf1e366.pdf" targe +t="_blank">Arseneau, Bruno J.</a><br/><span>1963-06-21</span><br/><a +href="pdf\21e4463d-d4ed-4ca6-9342-01c011df1683.pdf" target="_blank">A +rseneau, Diane</a><br/><span>1959-01-18</span><br/><a href="pdf\f5630 +75f-e3c9-4efa-8baa-ad467ed5f179.pdf" target="_blank">Arseneau, Paolo< +/a><br/><span>1970-12-17</span><br/><a href="pdf\7e46cdc1-db26-4f87-8 +dd0-c25e3ac8dd3c.pdf" target="_blank">Arseneau, Reynald</a><br/><span +>1954-04-17</span><br/><a href="pdf\24a035cc-b1e6-4d62-8104-6d57e498e +d6c.pdf" target="_blank">Arseneau, Romeo</a><br/><span>1950-01-15</sp +an><br/><a href="pdf\98a24d60-a90c-459c-b904-e4fd30ff2bee.pdf" target +="_blank">Arseneau, Steve</a><br/><span>1978-02-09</span><br/><a href +="pdf\bb27003b-fbda-4336-b37c-00e7133eb5da.pdf" target="_blank">Arsen +eault, Albert J.</a><br/><span>1944-07-16</span><br/><a href="pdf\8f2 +ea821-40c0-433b-88a3-760d7d5244b6.pdf" target="_blank">Arseneault, Fr +ance</a><br/><span>1949-07-13</span><br/><a href="pdf\9b6f871f-bed9-4 +4d4-99f9-1136609d04ed.pdf" target="_blank">Arseneault, Gilles R.</a>< +br/><span>1953-11-10</span><br/><a href="pdf\bbf2e79b-d345-46a2-9576- +933ad710af0f.pdf" target="_blank">Asoyuf, Charles</a><br/><span>1958- +05-13</span><br/><a href="pdf\3dfde6e6-89fb-49f1-a5c8-da3b594d1361.pd +f" target="_blank">Atkinson, Michael</a><br/><span>1964-09-07</span>< +br/><a href="pdf\2eba7345-a583-4453-9639-c7a62e97c2f3.pdf" target="_b +lank">Aube, Marilyn</a><br/><span>1960-12-14</span><br/><a href="pdf\ +19a9a977-65df-4d7e-8e4d-8c71592ebea1.pdf" target="_blank">Aube, Ronal +d</a><br/><span>1955-10-08</span><br/><a href="pdf\fb4c734e-5d3e-48bf +-a13b-9623bf850851.pdf" target="_blank">Aubie, Kenneth J.</a><br/><sp +an>1943-03-22</span><br/><a href="pdf\6dbc479e-1177-49a4-ae0f-8386ac3 +83a46.pdf" target="_blank">Aubin, Bertrand</a><br/><span>1954-07-07</ +span><br/><a href="pdf\73635a37-2292-4ba6-a9db-126ca159d75c.pdf" targ +et="_blank">Aubin, Weeda</a><br/><span>1954-04-06</span><br/><a href= +"pdf\1a35deb4-15c9-4ace-b81e-e1a73ab038cb.pdf" target="_blank">Aubut, + Guy</a><br/><span>1959-05-28</span><br/><a href="pdf\bc2e6eec-2861-4 +1e4-a812-ce2f36f97738.pdf" target="_blank">Austin, Roger</a><br/><spa +n>1946-01-12</span><br/><a href="pdf\4487ff07-6115-448e-b340-4c2c4a3d +0ec4.pdf" target="_blank">Avery, Karin</a><br/><span>1962-03-17</span +><br/></td> </tr> </table>
I also have this piece of code
#!/usr/bin/perl -w use Env; my $re = qr/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).)*)\1 # the value in the href [^>]*> # anything to the end of the anchor [^<>]* # the content in the anchor tag <\/a> # the end of the anchor (?:\s|<[^>]*>)+ # any whitespace or html tags (\d{4}-\d{2}-\d{2}) # the 9 digit number /isxm ; open (DATA,$ARGV[0]) ; my $string = do{local $/; <DATA>}; while($string =~ m/$re/g){ my $name = $1 ; my $STMT_FILE = $2 ; my $DOB = $3 ; print "Date of Birth is: $DOB -- Statement File: $STMT_FILE\n" +; }
My problem is I need the Name from the anchor tag also. If anyone can tell me how I can get the name it woudl be very much appricated.

Replies are listed 'Best First'.
Re: Finding data in html file
by perlfan (Vicar) on Sep 21, 2007 at 20:01 UTC
    HTML::Parser? ... if not, there are many others that might fit the bill.
Re: Finding data in html file
by un-chomp (Scribe) on Sep 21, 2007 at 21:21 UTC
    One way to do it with HTML::TokeParser:
    use HTML::TokeParser; my $doc = do { local $/; <DATA> }; my $p = HTML::TokeParser->new( \$doc ); while ( my $token = $p->get_tag("a") ) { my $url = $token->[1]{href}; my $text = $p->get_trimmed_text("/a"); print "Link is: $url\n"; print "Name is: $text\n\n"; }
Re: Finding data in html file
by ww (Archbishop) on Sep 21, 2007 at 21:58 UTC
    Please consider posting a minimal case, next time you have a question.

    The html contains a far larger data sample than we need to help with your perl question.

Re: Finding data in html file
by Skeeve (Parson) on Sep 21, 2007 at 20:03 UTC
    You mean the text of the anchor?
    my $re = qr/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).)*)\1 # the value in the href [^>]*> # anything to the end of the anchor ([^<>]*) # Set brackets around it and get it +as $2 <\/a> # the end of the anchor (?:\s|<[^>]*>)+? # I THINK you need the ? here, otherwi +se you would slurp everything up to the last date (\d{4}-\d{2}-\d{2}) # the 9 digit number /isxm ;

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Finding data in html file
by scorpio17 (Canon) on Sep 21, 2007 at 21:29 UTC
    Try this version:
    use Env; my $re = qr/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).)*)\1 # the value in the href [^>]*> # anything to the end of the anchor ([^<>]*) # <- ADDED () HERE! <\/a> # the end of the anchor (?:\s|<[^>]*>)+ # any whitespace or html tags (\d{4}-\d{2}-\d{2}) # the 9 digit number /isxm ; open (DATA,$ARGV[0]) ; my $string = do{local $/; <DATA>}; while($string =~ m/$re/g){ my $name = $3 ; # ADDED THIS my $STMT_FILE = $2 ; $STMT_FILE =~ s/\s+//g; my $DOB = $4 ; # ADDED THIS print "name: $name\n"; print "Date of Birth is: $DOB\n"; print "Statement File: $STMT_FILE\n"; print "\n"; }
      Thanks this works great.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://640430]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2022-06-25 11:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (81 votes). Check out past polls.

    Notices?