Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Match text from txt to html

by jcb (Parson)
on Sep 04, 2019 at 22:21 UTC ( [id://11105637]=note: print w/replies, xml ) Need Help??


in reply to Match text from txt to html

It looks like you want a parser. HTML::Parser will do almost exactly what you want, although you may need to handle the extra HTML tags in the input.

HTML::Parser has a fairly simple callback interface that provides parsing events as the parser recognizes them. Since the HTML and TXL files will contain the sentences in the same order, and if the pattern in your example of the sentences being wrapped in paragraph tags holds for your actual data, your text callback need only maintain a buffer and check that buffer against the remaining pending lines. To speed parsing, you can use the ->report_tags filter to select only paragraph tags and use the "skipped_text" callback parameter to recover the skipped tags for output. I am assuming that a closing paragraph tag cannot occur in the middle of a sentence, so you can clear the buffer when a paragraph ends. To speed matching, use qr// on the lines as you read them from the TXL file into the pending sentences array. When a sentence matches, simply splice the array to remove the skipped sentences: splice @pending, 0, $i if $text_buffer =~ $pending[$i];. Consider a Schwartzian Transform if you need other data along with the regexes.

If performance is a concern, you will be pleased to hear that HTML::Parser is an XS module and runs the main parsing loop in XS code. I once thought it was slow, until I realized that my script was chewing through a few dozen HTML files in about 30 seconds and that loading one of those pages (from the local disk!) into my browser needed about 10 seconds. (These were large documents.)

Replies are listed 'Best First'.
Re^2: Match text from txt to html
by Anonymous Monk on Sep 05, 2019 at 01:00 UTC

      Just as a text file is both a set of lines and a stream of bytes, an HTML document is both a tree and a stream of elements. HTML::Parser extracts the latter, which is equivalent to walking the DOM tree in some order. The advantage of using HTML::Parser for an application like this is the same as the advantage of processing a text file line-by-line without reading the whole file into memory.

      While it is unlikely that an HTML document would not fit into memory on a client, our questioner could be building something that runs on a server, with an instance of the program for each concurrent client connection which can quickly become very large in aggregate if many clients are active. In this case, building the entire tree in memory is unnecessary because the transformation to be applied is very simple: find and mark ocurrances of certain text in a finite sliding window. If this is running on a server, building the DOM tree in memory is both wasteful and foolish, creating an opportunity for easy DoS attacks.

      Put simply, if you do not actually need the DOM tree, do not waste time and memory building it!

        Ever used XML::Twig or XML::LibXML? Ever heard of them? They both give you all the DOM goodness in steaming mode, perlmonks is full of examples

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11105637]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-25 22:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found