parsing html

qingxia has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: parsing html by tobyink (Canon) on Mar 21, 2013 at 21:24 UTC
McA's answer is technically correct, but going down the regexp route is likely to cause you more pain further down the road. For example, have you considered HTML where a greater-than sign legitimately occurs in an attribute? `<td title="n > 5">n greater than 5</td>` [download] Are you aware that the `</td>` closing tag is optional (as per the HTML 3.2, HTML 4 and HTML 5 specs). So the following is legitimate: `<tr> <td>1 <td>2 <td>3</td> </tr>` [download] You're better off using one of the many HTML parsing modules on CPAN which will already cover these corner cases. `package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name`	[reply] [d/l] [select]
Re: parsing html by kennethk (Abbot) on Mar 21, 2013 at 21:31 UTC
McA's solution will fix your immediate question, but if you are parsing HTML in anything other than an educational or 1-off context, I would suggest you use a CPAN module rather than reinvent the wheel; perhaps HTML::Parser or Mojo::DOM would be helpful. HTML in the wild is notoriously hard to handle in a general way. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply]
Re^2: parsing html by qingxia (Novice) on Mar 21, 2013 at 23:06 UTC
And to toyyink and kennethk, it is actually a dataset which i need to prepare for the next stage analysis. it comes in as several html files and each of them contains a rather stable pattern like: `id xxx borrower xxx date xxx ...` [download] and i want to code them into some standard format which can be read by some commercial statistical software like stata. e.g. `id borrower date ... xxx xxxx xxxx` [download] and it is a little too time-consuming to do it in excel, so i switch to perl as i really would like to learn it. doing by learning would be more fun. you can say it is a kind of a one-off project because i will (hope) not frequently parse HTML but thank you anyway for the suggestion, totally agreed with you. best regards,sh	[reply] [d/l] [select]
Re^3: parsing html by kennethk (Abbot) on Mar 22, 2013 at 14:20 UTC
When I said "1-off context", this is exactly what I meant; a quick script to process 1 set of data. I wholly support your choice of regex for this task. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply]
Re: parsing html by McA (Priest) on Mar 21, 2013 at 20:15 UTC
regex modifier 's' should do the trick: `$line =~ /<td(.?)>(.?)<\/td>/s` [download] McA	[reply] [d/l]
Re^2: parsing html by qingxia (Novice) on Mar 21, 2013 at 23:03 UTC
thx to McA. It works well.	[reply]


"be consistent"
	PerlMonks