It looks like you want a parser. HTML::Parser will do almost exactly what you want, although you may need to handle the extra HTML tags in the input.
HTML::Parser has a fairly simple callback interface that provides parsing events as the parser recognizes them. Since the HTML and TXL files will contain the sentences in the same order, and if the pattern in your example of the sentences being wrapped in paragraph tags holds for your actual data, your text callback need only maintain a buffer and check that buffer against the remaining pending lines. To speed parsing, you can use the ->report_tags filter to select only paragraph tags and use the "skipped_text" callback parameter to recover the skipped tags for output. I am assuming that a closing paragraph tag cannot occur in the middle of a sentence, so you can clear the buffer when a paragraph ends. To speed matching, use qr// on the lines as you read them from the TXL file into the pending sentences array. When a sentence matches, simply splice the array to remove the skipped sentences: splice @pending, 0, $i if $text_buffer =~ $pending[$i];. Consider a Schwartzian Transform if you need other data along with the regexes.
If performance is a concern, you will be pleased to hear that HTML::Parser is an XS module and runs the main parsing loop in XS code. I once thought it was slow, until I realized that my script was chewing through a few dozen HTML files in about 30 seconds and that loading one of those pages (from the local disk!) into my browser needed about 10 seconds. (These were large documents.)
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|