Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Finding data in html file

by Griffler (Novice)
on Sep 21, 2007 at 19:37 UTC ( [id://640430]=perlquestion: print w/replies, xml ) Need Help??

Griffler has asked for the wisdom of the Perl Monks concerning the following question:

I have this file that looks like this:
<table id="a" border="1" bordercolor="#333366" cellpadding="5" cellspa +cing="0" width="100%"> <tr> <td width="33%" class="clsTableBody" valign +="top" id="firstCol"><a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc04 +4.pdf" target="_blank">Abbott, Evelyn</a><br/><span>1953-05-28</span> +<br/><a href="pdf\2aae5e89-4370-4b31-bbbf-c6c39d5761bd.pdf" target="_ +blank">Addison, Cheryl</a><br/><span>1958-05-26</span><br/><a href="p +df\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" target="_blank">Agnew, T +homas</a><br/><span>1953-06-15</span><br/><a href="pdf\67ac0d08-f295- +4e37-bdf3-9e2c22a6f579.pdf" target="_blank">Albert, Carole</a><br/><s +pan>1968-01-30</span><br/><a href="pdf\1947e19b-0073-4b9a-94c4-176e4b +15b4bb.pdf" target="_blank">Albert, Franklin</a><br/><span>1957-10-22 +</span><br/><a href="pdf\b3a1661d-6e51-49e4-835f-385a8880307a.pdf" ta +rget="_blank">Albert, Gaston</a><br/><span>1951-12-18</span><br/><a h +ref="pdf\8a32f738-acb9-446a-aab6-e352887cbb25.pdf" target="_blank">Al +bert, Kim</a><br/><span>1967-08-08</span><br/><a href="pdf\4e3d1d45-d +a1b-4af9-9450-3bfd4df96f1c.pdf" target="_blank">Alchorn, James A.</a> +<br/><span>1961-10-30</span><br/><a href="pdf\855dee39-5573-4610-b26e +-2567211ca01a.pdf" target="_blank">Allaby, Sue Ellen</a><br/><span>19 +64-09-13</span><br/><a href="pdf\447ebebb-8c68-4670-83ad-ebcc43e0eccc +.pdf" target="_blank">Allain, Jack</a><br/><span>1944-11-20</span><br +/><a href="pdf\e7fcbace-d956-4e3a-b78a-d0c737c21cf5.pdf" target="_bla +nk">Allain, John J.</a><br/><span>1965-09-02</span><br/><a href="pdf\ +6864afc6-1d68-453c-940e-8d6c4500e15d.pdf" target="_blank">Allain, Rhe +al</a><br/><span>1954-08-01</span><br/><a href="pdf\0c345afa-5bd4-43b +a-b2fd-7b93398c58e4.pdf" target="_blank">Allain, Rosario</a><br/><spa +n>1951-10-27</span><br/><a href="pdf\690239be-3d99-4f42-b4b0-669ef2e5 +df85.pdf" target="_blank">Allen, Graham</a><br/><span>1965-02-20</spa +n><br/><a href="pdf\9c52b0f8-8342-4658-ab20-e85c2c1f1ee9.pdf" target= +"_blank">Allen, John W.</a><br/><span>1950-11-15</span><br/><a href=" +pdf\a87ea9c5-ad8f-4c81-a07c-f5db87142754.pdf" target="_blank">Allen, +Kevin</a><br/><span>1975-12-23</span><br/><a href="pdf\952ad3be-4381- +455a-a5bd-0fccc1608a66.pdf" target="_blank">Allen, Patricia</a><br/>< +span>1952-03-30</span><br/><a href="pdf\1c268749-7d37-4055-9498-73d9f +1d4d7bb.pdf" target="_blank">Allen, Paulette S.</a><br/><span>1951-12 +-27</span><br/><a href="pdf\7893d592-92ec-4c62-95eb-34e28f361fa5.pdf" + target="_blank">Allen, W. Doug</a><br/><span>1946-09-28</span><br/>< +a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" target="_blank" +>Allison, Gary Owen</a><br/><span>1958-10-02</span><br/><a href="pdf\ +deebbfaf-3e1d-4342-8bde-7aa3128d7142.pdf" target="_blank">Alward, Hal +dean</a><br/><span>1950-11-04</span><br/><a href="pdf\eb5ec017-c6a5-4 +120-bb5c-bb1b2248cf34.pdf" target="_blank">Amos, Alfred</a><br/><span +>1950-05-31</span><br/></td> <td width="34%" class="clsTableBodyClear" v +align="top" id="secondCol"><a href="pdf\fe74c873-d3ed-4053-960b-e7355 +38167b6.pdf" target="_blank">Amos, Allen R.</a><br/><span>1950-05-31< +/span><br/><a href="pdf\575c388c-42b0-400e-80ca-b3cec73df9cc.pdf" tar +get="_blank">Anderson, C. Frederick</a><br/><span>1951-11-26</span><b +r/><a href="pdf\ea944df7-9da1-4953-b200-5f06876160a8.pdf" target="_bl +ank">Anderson, Jason</a><br/><span>1970-01-27</span><br/><a href="pdf +\61226f6e-838a-437c-9cb9-fa13f805684a.pdf" target="_blank">Anderson, +Joan</a><br/><span>1948-10-02</span><br/><a href="pdf\31c3c9a3-79db-4 +e2d-b092-346e23a6f857.pdf" target="_blank">Anthony, Elizabeth</a><br/ +><span>1961-02-23</span><br/><a href="pdf\70c758ce-2d38-4af1-9512-738 +2def72ac1.pdf" target="_blank">Anthony, Stephen</a><br/><span>1965-05 +-12</span><br/><a href="pdf\2aee840d-acb9-46c0-ad8c-4dec2fca67e5.pdf" + target="_blank">Arbeau, Bev</a><br/><span>1948-07-09</span><br/><a h +ref="pdf\4f054b87-fc9d-4d1e-b3b3-d51e4bfac9a0.pdf" target="_blank">Ar +beau, Marshall</a><br/><span>1949-09-05</span><br/><a href="pdf\794a7 +e41-71c4-4736-a184-3a048d85a283.pdf" target="_blank">Arbeau, Ricky C. +B.</a><br/><span>1960-02-08</span><br/><a href="pdf\f6180b5e-17d2-48c +9-a4a8-aee3ef874c77.pdf" target="_blank">Armstrong, David J.</a><br/> +<span>1956-09-09</span><br/><a href="pdf\5fd4d40b-53e4-4330-8dca-f214 +69b2d2e5.pdf" target="_blank">Armstrong, Jo-Ann</a><br/><span>1955-08 +-08</span><br/><a href="pdf\ac6fdde6-e68e-4c81-ae10-cd935fe18ee9.pdf" + target="_blank">Armstrong, Richard</a><br/><span>1954-12-26</span><b +r/><a href="pdf\efcec029-990e-4d52-8d40-d67434f4c251.pdf" target="_bl +ank">Arpin, Pierre</a><br/><span>1954-11-20</span><br/><a href="pdf\0 +d5af644-0b4d-4609-93a7-eac1a4ae2b9d.pdf" target="_blank">Arsenault, A +ntonio J.</a><br/><span>1965-08-17</span><br/><a href="pdf\03aa785d-c +848-49aa-a5b2-830e15f97028.pdf" target="_blank">Arsenault, Charles</a +><br/><span>1958-12-27</span><br/><a href="pdf\146f090b-faff-4721-8b1 +3-0242e1ec50bf.pdf" target="_blank">Arsenault, David J.</a><br/><span +>1947-06-10</span><br/><a href="pdf\dbe2e5d1-74f5-4b47-bb18-bb40da265 +a66.pdf" target="_blank">Arsenault, Donald</a><br/><span>1959-11-05</ +span><br/><a href="pdf\2a03629f-e723-43dd-b10c-4a55dbdb7974.pdf" targ +et="_blank">Arsenault, Earl</a><br/><span>1949-03-23</span><br/><a hr +ef="pdf\b05ec9e4-2142-457f-9c24-d842157b57da.pdf" target="_blank">Ars +enault, Jacqueline J.</a><br/><span>1954-12-21</span><br/><a href="pd +f\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf" target="_blank">Arsenault +, Michael</a><br/><span>1953-02-25</span><br/><a href="pdf\e646f948-f +78d-4463-a01d-0261aebf70dc.pdf" target="_blank">Arsenault, Normand A. +</a><br/><span>1957-02-10</span><br/><a href="pdf\d964a7e1-4e8d-459c- +812b-2124a6896438.pdf" target="_blank">Arsenault, Rheal</a><br/><span +>1952-12-08</span><br/></td> <td width="33%" class="clsTableBody" valign +="top" id="thirdCol"><a href="pdf\0d5ad8f8-98da-4abb-b69d-36bb98d7f6c +4.pdf" target="_blank">Arsenault, Thaddée</a><br/><span>1958-08-01</s +pan><br/><a href="pdf\f582adad-82c8-44c2-ab38-18edfcf1e366.pdf" targe +t="_blank">Arseneau, Bruno J.</a><br/><span>1963-06-21</span><br/><a +href="pdf\21e4463d-d4ed-4ca6-9342-01c011df1683.pdf" target="_blank">A +rseneau, Diane</a><br/><span>1959-01-18</span><br/><a href="pdf\f5630 +75f-e3c9-4efa-8baa-ad467ed5f179.pdf" target="_blank">Arseneau, Paolo< +/a><br/><span>1970-12-17</span><br/><a href="pdf\7e46cdc1-db26-4f87-8 +dd0-c25e3ac8dd3c.pdf" target="_blank">Arseneau, Reynald</a><br/><span +>1954-04-17</span><br/><a href="pdf\24a035cc-b1e6-4d62-8104-6d57e498e +d6c.pdf" target="_blank">Arseneau, Romeo</a><br/><span>1950-01-15</sp +an><br/><a href="pdf\98a24d60-a90c-459c-b904-e4fd30ff2bee.pdf" target +="_blank">Arseneau, Steve</a><br/><span>1978-02-09</span><br/><a href +="pdf\bb27003b-fbda-4336-b37c-00e7133eb5da.pdf" target="_blank">Arsen +eault, Albert J.</a><br/><span>1944-07-16</span><br/><a href="pdf\8f2 +ea821-40c0-433b-88a3-760d7d5244b6.pdf" target="_blank">Arseneault, Fr +ance</a><br/><span>1949-07-13</span><br/><a href="pdf\9b6f871f-bed9-4 +4d4-99f9-1136609d04ed.pdf" target="_blank">Arseneault, Gilles R.</a>< +br/><span>1953-11-10</span><br/><a href="pdf\bbf2e79b-d345-46a2-9576- +933ad710af0f.pdf" target="_blank">Asoyuf, Charles</a><br/><span>1958- +05-13</span><br/><a href="pdf\3dfde6e6-89fb-49f1-a5c8-da3b594d1361.pd +f" target="_blank">Atkinson, Michael</a><br/><span>1964-09-07</span>< +br/><a href="pdf\2eba7345-a583-4453-9639-c7a62e97c2f3.pdf" target="_b +lank">Aube, Marilyn</a><br/><span>1960-12-14</span><br/><a href="pdf\ +19a9a977-65df-4d7e-8e4d-8c71592ebea1.pdf" target="_blank">Aube, Ronal +d</a><br/><span>1955-10-08</span><br/><a href="pdf\fb4c734e-5d3e-48bf +-a13b-9623bf850851.pdf" target="_blank">Aubie, Kenneth J.</a><br/><sp +an>1943-03-22</span><br/><a href="pdf\6dbc479e-1177-49a4-ae0f-8386ac3 +83a46.pdf" target="_blank">Aubin, Bertrand</a><br/><span>1954-07-07</ +span><br/><a href="pdf\73635a37-2292-4ba6-a9db-126ca159d75c.pdf" targ +et="_blank">Aubin, Weeda</a><br/><span>1954-04-06</span><br/><a href= +"pdf\1a35deb4-15c9-4ace-b81e-e1a73ab038cb.pdf" target="_blank">Aubut, + Guy</a><br/><span>1959-05-28</span><br/><a href="pdf\bc2e6eec-2861-4 +1e4-a812-ce2f36f97738.pdf" target="_blank">Austin, Roger</a><br/><spa +n>1946-01-12</span><br/><a href="pdf\4487ff07-6115-448e-b340-4c2c4a3d +0ec4.pdf" target="_blank">Avery, Karin</a><br/><span>1962-03-17</span +><br/></td> </tr> </table>
I also have this piece of code
#!/usr/bin/perl -w use Env; my $re = qr/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).)*)\1 # the value in the href [^>]*> # anything to the end of the anchor [^<>]* # the content in the anchor tag <\/a> # the end of the anchor (?:\s|<[^>]*>)+ # any whitespace or html tags (\d{4}-\d{2}-\d{2}) # the 9 digit number /isxm ; open (DATA,$ARGV[0]) ; my $string = do{local $/; <DATA>}; while($string =~ m/$re/g){ my $name = $1 ; my $STMT_FILE = $2 ; my $DOB = $3 ; print "Date of Birth is: $DOB -- Statement File: $STMT_FILE\n" +; }
My problem is I need the Name from the anchor tag also. If anyone can tell me how I can get the name it woudl be very much appricated.

Replies are listed 'Best First'.
Re: Finding data in html file
by perlfan (Vicar) on Sep 21, 2007 at 20:01 UTC
    HTML::Parser? ... if not, there are many others that might fit the bill.
Re: Finding data in html file
by un-chomp (Scribe) on Sep 21, 2007 at 21:21 UTC
    One way to do it with HTML::TokeParser:
    use HTML::TokeParser; my $doc = do { local $/; <DATA> }; my $p = HTML::TokeParser->new( \$doc ); while ( my $token = $p->get_tag("a") ) { my $url = $token->[1]{href}; my $text = $p->get_trimmed_text("/a"); print "Link is: $url\n"; print "Name is: $text\n\n"; }
Re: Finding data in html file
by ww (Archbishop) on Sep 21, 2007 at 21:58 UTC
    Please consider posting a minimal case, next time you have a question.

    The html contains a far larger data sample than we need to help with your perl question.

Re: Finding data in html file
by Skeeve (Parson) on Sep 21, 2007 at 20:03 UTC
    You mean the text of the anchor?
    my $re = qr/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).)*)\1 # the value in the href [^>]*> # anything to the end of the anchor ([^<>]*) # Set brackets around it and get it +as $2 <\/a> # the end of the anchor (?:\s|<[^>]*>)+? # I THINK you need the ? here, otherwi +se you would slurp everything up to the last date (\d{4}-\d{2}-\d{2}) # the 9 digit number /isxm ;

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Finding data in html file
by scorpio17 (Canon) on Sep 21, 2007 at 21:29 UTC
    Try this version:
    use Env; my $re = qr/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).)*)\1 # the value in the href [^>]*> # anything to the end of the anchor ([^<>]*) # <- ADDED () HERE! <\/a> # the end of the anchor (?:\s|<[^>]*>)+ # any whitespace or html tags (\d{4}-\d{2}-\d{2}) # the 9 digit number /isxm ; open (DATA,$ARGV[0]) ; my $string = do{local $/; <DATA>}; while($string =~ m/$re/g){ my $name = $3 ; # ADDED THIS my $STMT_FILE = $2 ; $STMT_FILE =~ s/\s+//g; my $DOB = $4 ; # ADDED THIS print "name: $name\n"; print "Date of Birth is: $DOB\n"; print "Statement File: $STMT_FILE\n"; print "\n"; }
      Thanks this works great.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://640430]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-25 18:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found