http://qs321.pair.com?node_id=11105589

corfuitl has asked for the wisdom of the Perl Monks concerning the following question:

Hi perlmonks

I have a problem to solve and I would appreciated your help.

I have a txt file which contains one sentence per line and its html correspondence, and I would like to write in the html which line matches with what sentence in the html. Please note that in html one line may contain more than one sentences.

For example, my TXL file looks like:

This is sentence 1 This is sentence 2 This is sentence 1 This is sentence 3

And the html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <body> <p>This is sentence 1</p> <p>This is <font color="#000000"><font face="Century Gothic, serif">se +ntence 2</font></font></p> <p>This is sentence 1. This is sentence <span style="font-weight: norm +al">3</span> </p> </body> </html>

As you can see, the html file contains some extra html code, which makes things complex.

What I want to do is to add some html code and wrap the sentence:

<p><sentence id=”1”>This is sentence 1></sentence></p> <p><sentence id=”2”>This is <font color="#000000"><font face="Century +Gothic, serif">sentence 2</font></font></p></sentence> <p><sentence id=”3”>This is sentence 1. </sentence><sentence id=”4”>Th +is is sentence <span style="font-weight: normal">3</span></sentence>< +/p>

The order of the TXT corresponds to the order of the HTML. If no match found, then it should go to the next segment. Any ideas?

Thanks in advance for your help

Replies are listed 'Best First'.
Re: Match text from txt to html
by marto (Cardinal) on Sep 04, 2019 at 14:17 UTC

    Here is something to get you started. Using Mojo::DOM, parse the HTML, match a single sentence then alter the DOM to do what you want:

    #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; # slurp in from file, or get using Mojo::UserAgent... my $html = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional// +EN"> <html> <body> <p>This is sentence 1</p> <p>This is <font color="#000000"><font face="Century Gothic, serif">se +ntence 2</font></font></p> <p>This is sentence 1. This is sentence <span style="font-weight: norm +al">3</span> </p> </body> </html>'; # new Mojo::DOM my $dom = Mojo::DOM->new( $html ); # for each p tag found for my $e ( $dom->find('p')->each ){ # use the all_text method to get all of the visible text, including +from # descending tags. In this short example to get you started match a +specific # string. if ( $e->all_text eq 'This is sentence 2' ){ # once we have a match, wrap the node around this: $e->wrap_content('<sentence id="2"></sentence>'); } } # print to screen, do whatever you want with the results. print $dom->content;

    Produces:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <body> <p>This is sentence 1</p> <p><sentence id="2">This is <font color="#000000"><font face="Century +Gothic, serif">sentence 2</font></font></sentence></p> <p>This is sentence 1. This is sentence <span style="font-weight: norm +al">3</span> </p> </body> </html>

    Now all you have to do is make this generic, but that's fairly trivial, and I'll leave that as an exercise for you.

    See also Re: Batch remove URLs or Super Search for more examples.

    Update: I assume the ordering of the second p/sentence tag here was a mistake, closing the p tag before the sentene:

    <p><sentence id=”1”>This is sentence 1></sentence></p> <p><sentence id=”2”>This is <font color="#000000"><font face="Century +Gothic, serif">sentence 2</font></font></p></sentence> <p><sentence id=”3”>This is sentence 1. </sentence><sentence id=”4”>Th +is is sentence <span style="font-weight: normal">3</span></sentence>< +/p>

      Thanks! Will try and will let you know. Is there any way to search in all nodes as I don't know the nodes of the html files? They are created automatically.

        # for each tag found for my $e ( $dom->find('*')->each ){
Re: Match text from txt to html
by talexb (Chancellor) on Sep 04, 2019 at 13:41 UTC

    Lots of ideas! But first, tell us what you've tried in Perl.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Hi,

      Thank you for your reply.

      To be honest, I have no idea... I know Perl but I don't know where to start.

      What I did, was to read the TXT and store it in an array, then I read line by line the HTML and match the sentences without tags.

        Great! And is the code working correctly? (Quietly loads the confetti cannon.)

        Alex / talexb / Toronto

        Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Re: Match text from txt to html
by jcb (Parson) on Sep 04, 2019 at 22:21 UTC

    It looks like you want a parser. HTML::Parser will do almost exactly what you want, although you may need to handle the extra HTML tags in the input.

    HTML::Parser has a fairly simple callback interface that provides parsing events as the parser recognizes them. Since the HTML and TXL files will contain the sentences in the same order, and if the pattern in your example of the sentences being wrapped in paragraph tags holds for your actual data, your text callback need only maintain a buffer and check that buffer against the remaining pending lines. To speed parsing, you can use the ->report_tags filter to select only paragraph tags and use the "skipped_text" callback parameter to recover the skipped tags for output. I am assuming that a closing paragraph tag cannot occur in the middle of a sentence, so you can clear the buffer when a paragraph ends. To speed matching, use qr// on the lines as you read them from the TXL file into the pending sentences array. When a sentence matches, simply splice the array to remove the skipped sentences: splice @pending, 0, $i if $text_buffer =~ $pending[$i];. Consider a Schwartzian Transform if you need other data along with the regexes.

    If performance is a concern, you will be pleased to hear that HTML::Parser is an XS module and runs the main parsing loop in XS code. I once thought it was slow, until I realized that my script was chewing through a few dozen HTML files in about 30 seconds and that loading one of those pages (from the local disk!) into my browser needed about 10 seconds. (These were large documents.)

        Just as a text file is both a set of lines and a stream of bytes, an HTML document is both a tree and a stream of elements. HTML::Parser extracts the latter, which is equivalent to walking the DOM tree in some order. The advantage of using HTML::Parser for an application like this is the same as the advantage of processing a text file line-by-line without reading the whole file into memory.

        While it is unlikely that an HTML document would not fit into memory on a client, our questioner could be building something that runs on a server, with an instance of the program for each concurrent client connection which can quickly become very large in aggregate if many clients are active. In this case, building the entire tree in memory is unnecessary because the transformation to be applied is very simple: find and mark ocurrances of certain text in a finite sliding window. If this is running on a server, building the DOM tree in memory is both wasteful and foolish, creating an opportunity for easy DoS attacks.

        Put simply, if you do not actually need the DOM tree, do not waste time and memory building it!