Re: Match text from txt to html

It looks like you want a parser. HTML::Parser will do almost exactly what you want, although you may need to handle the extra HTML tags in the input.

HTML::Parser has a fairly simple callback interface that provides parsing events as the parser recognizes them. Since the HTML and TXL files will contain the sentences in the same order, and if the pattern in your example of the sentences being wrapped in paragraph tags holds for your actual data, your text callback need only maintain a buffer and check that buffer against the remaining pending lines. To speed parsing, you can use the ->report_tags filter to select only paragraph tags and use the "skipped_text" callback parameter to recover the skipped tags for output. I am assuming that a closing paragraph tag cannot occur in the middle of a sentence, so you can clear the buffer when a paragraph ends. To speed matching, use qr// on the lines as you read them from the TXL file into the pending sentences array. When a sentence matches, simply splice the array to remove the skipped sentences: splice @pending, 0, $i if $text_buffer =~ $pending[$i];. Consider a Schwartzian Transform if you need other data along with the regexes.

If performance is a concern, you will be pleased to hear that HTML::Parser is an XS module and runs the main parsing loop in XS code. I once thought it was slow, until I realized that my script was chewing through a few dozen HTML files in about 30 seconds and that loading one of those pages (from the local disk!) into my browser needed about 10 seconds. (These were large documents.)

Comment on Re: Match text from txt to html Select or Download Code

Replies are listed 'Best First'.
Re^2: Match text from txt to html by Anonymous Monk on Sep 05, 2019 at 01:00 UTC
It looks like you want a parser. HTML::Parser will do almost exactly what you want, although you may need to handle the extra HTML tags in the input. No. HTML::Parser low level, it doesn't give you tree. A html document is a tree ( Document Object Model). You can use XML::Twig or HTML::TreeBuilder::XPath, XML::LibXML ... Or as marto shows Mojo::DOM	[reply]
Re^3: Match text from txt to html by jcb (Parson) on Sep 05, 2019 at 01:40 UTC
Just as a text file is both a set of lines and a stream of bytes, an HTML document is both a tree and a stream of elements. `HTML::Parser` extracts the latter, which is equivalent to walking the DOM tree in some order. The advantage of using `HTML::Parser` for an application like this is the same as the advantage of processing a text file line-by-line without reading the whole file into memory. While it is unlikely that an HTML document would not fit into memory on a client, our questioner could be building something that runs on a server, with an instance of the program for each concurrent client connection which can quickly become very large in aggregate if many clients are active. In this case, building the entire tree in memory is unnecessary because the transformation to be applied is very simple: find and mark ocurrances of certain text in a finite sliding window. If this is running on a server, building the DOM tree in memory is both wasteful and foolish, creating an opportunity for easy DoS attacks. Put simply, if you do not actually need the DOM tree, do not waste time and memory building it!	[reply] [d/l] [select]
Re^4: Match text from txt to html by Anonymous Monk on Sep 06, 2019 at 04:04 UTC
Ever used XML::Twig or XML::LibXML? Ever heard of them? They both give you all the DOM goodness in steaming mode, perlmonks is full of examples	[reply]
Re^5: Match text from txt to html by jcb (Parson) on Sep 06, 2019 at 04:09 UTC
Re^6: Match text from txt to html by Your Mother (Archbishop) on Sep 06, 2019 at 05:03 UTC
Some notes below your chosen depth have not been shown here
Re^6: Match text from txt to html by Anonymous Monk on Sep 06, 2019 at 08:31 UTC
Some notes below your chosen depth have not been shown here


No such thing as a small change
	PerlMonks