comment on

It looks like you want a parser. HTML::Parser will do almost exactly what you want, although you may need to handle the extra HTML tags in the input.

HTML::Parser has a fairly simple callback interface that provides parsing events as the parser recognizes them. Since the HTML and TXL files will contain the sentences in the same order, and if the pattern in your example of the sentences being wrapped in paragraph tags holds for your actual data, your text callback need only maintain a buffer and check that buffer against the remaining pending lines. To speed parsing, you can use the ->report_tags filter to select only paragraph tags and use the "skipped_text" callback parameter to recover the skipped tags for output. I am assuming that a closing paragraph tag cannot occur in the middle of a sentence, so you can clear the buffer when a paragraph ends. To speed matching, use qr// on the lines as you read them from the TXL file into the pending sentences array. When a sentence matches, simply splice the array to remove the skipped sentences: splice @pending, 0, $i if $text_buffer =~ $pending[$i];. Consider a Schwartzian Transform if you need other data along with the regexes.

If performance is a concern, you will be pleased to hear that HTML::Parser is an XS module and runs the main parsing loop in XS code. I once thought it was slow, until I realized that my script was chewing through a few dozen HTML files in about 30 seconds and that loading one of those pages (from the local disk!) into my browser needed about 10 seconds. (These were large documents.)

In reply to Re: Match text from txt to html by jcb
in thread Match text from txt to html by corfuitl

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


"be consistent"
	PerlMonks