Parsing a large html with perl

zesys has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl monks, first time asking a question. My apologies in advance if my post is not meeting house rules.

So, I am trying to extract information from a large html file.

Example blocks in the html:

---
<tr>
<td class="confluenceTd">AL2H </td>
<td class="confluenceTd">NANOMETRICSTITANSMA000357 </td>
<td class="confluenceTd">2017-03-09T00:00:00.000Z </td>
<td class="confluenceTd">2017-05-25T12:00:00.000Z </td>
<td class="confluenceTd"><p><span class="image-wrap" style=""><img src
+="/download/attachments/49449087/true.png?versio$
<td class="confluenceTd">48.38981 </td>
<td class="confluenceTd">-123.48739 </td>
<td class="confluenceTd">-29.0 </td>
<td class="confluenceTd">&nbsp;</td>
</tr>

---

<tr>
<td class="confluenceTd">BACAX </td>
<td class="confluenceTd">RDIADCP600WH9339 </td>
<td class="confluenceTd">2011-07-15T18:42:25.000Z </td>
<td class="confluenceTd">2012-05-30T01:12:03.000Z </td>
<td class="confluenceTd"><p><span class="image-wrap" style=""><img src
+="/download/attachments/49449087/true.png?versio$
<td class="confluenceTd">48.316762 </td>
<td class="confluenceTd">-126.050163 </td>
<td class="confluenceTd">985.0 </td>
<td class="confluenceTd">221.0 </td>
</tr>
---
[download]

What my code does at the moment: Copy the first 4 lines of the first html block above, and print them with their meanings.

locationCode: AL2H
deviceCode: NANOMETRICSTITANSMA000357
dateFrom: 2017-03-09T00:00:00.000Z
dateTo: 2017-05-25T12:00:00.000Z
[download]

What I would like to achieve:

1. Do the same thing as above by looping through similar blocks.

2. Extract only blocks that have a sub-string "RDI" in their second line (eg., RDIADCP600WH9339 in the second block shown above).

I can try 2 if I can get help with 1.

Thank you.

My semi-working code is below. As you can see, I am storing the html page in a variable, $scrappy.

 
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Term::ANSIColor qw(:constants);

my $scrappy =
`curl -s 'https://wiki.oceannetworks.ca/display/O2A/Available+Deployme
+nts' 2>&1`;

my $lineX;

my $count = 0;

foreach $lineX ( split /\n/, $scrappy ) {

    if ( $lineX =~ /^\s*$/ ) {    # Skip white spaces or comment line
        next;
    }
    my @F = split( " ", $lineX );
    my $mylen = length $lineX;

    if ( $mylen ge 2 ) {

        if (    ( $F[0] eq '<td' )
            and ( $F[-1] eq '</td>' )
            and ( $F[-1] ne '</p></td>' ) )
        {

            my @f = split />/, $F[1];

            $count++;

            if ( $count == 1 ) {
                print "locationCode: $f[1]\n";
            }
            elsif ( $count == 2 ) {
                print "deviceCode: $f[1]\n";

            }
            elsif ( $count == 3 ) {
                print "dateFrom: $f[1]\n";
            }
            elsif ( $count == 4 ) {
                print "dateTo: $f[1]\n";
            }

        }
    }
}
[download]

Comment on Parsing a large html with perl Select or Download Code

Replies are listed 'Best First'.
Re: Parsing a large html with perl by haukex (Archbishop) on Jun 02, 2020 at 21:23 UTC
Welcome to the Monastery, zesys! The top of that page says: The following is dynamic list of all of the deployments that have data. It is being pulled from the deployments web service using the URL https://data.oceannetworks.ca/api/deployments?method=get&token=[YOUR_TOKEN_HERE] Why don't you just use that API? Anyway, if you need to parse HTML, then don't use regular expressions. Here's an example with Mojo::DOM: use warnings; use strict; use Mojo::UserAgent; use Mojo::DOM; my $ua = Mojo::UserAgent->new( max_redirects=>3 ); my $dom = $ua->get( 'https://wiki.oceannetworks.ca/display/O2A/Available+Deployments' )->result->dom; $dom->find('.confluenceTable tr')->each(sub { my $tr = shift; my ($locationCode, $deviceCode, $dateFrom, $dateTo) = map { $tr->find(".confluenceTd:nth-of-type($_)") ->map('all_text')->join } 1..4; print "locationCode=$locationCode, deviceCode=$deviceCode, ", "dateFrom=$dateFrom, dateTo=$dateTo\n"; }); [download]	[reply] [d/l]
Re^2: Parsing a large html with perl by zesys (Novice) on Jun 04, 2020 at 04:42 UTC
Thanks so much @haukex. I have added two lines of code to yours (had two questions), and problem solved! Regarding the API, I use the service using client libraries written for python, almost everyday. I just wanted to do things differently this time by using Perl, for which the organisation does not seem to have a client library. Thank you all for your prompt answers and suggestions!!	[reply]
Re^3: Parsing a large html with perl by marto (Cardinal) on Jun 04, 2020 at 07:41 UTC
You don't need them to provide a client library in perl, writing your own is reasonably straightforward. The advantage of using their API is that generally speaking they are less suceptable to change than a webpage. Super Search for mojo api will find results to get you started.	[reply]
Re^2: Parsing a large html with perl by perlfan (Vicar) on Jun 03, 2020 at 04:17 UTC
OP, please do use the URL at https://wiki.oceannetworks.ca/display/O2A/API+Reference that haukex pointed out. it's a HTTP::Tiny call away! (hopefully an https URL is available) it's JSON! you'll learn a lot and be glad you did Note: If you do it right, you could get a Perl client listed in there. Also, see if it'll accept the query string via POST body, be sure to set your content-type header in the request to be `application/x-www-form-urlencoded`. Reason is, sending your special token via GET request is gonna get it logged everywhere and it's not protected by `https` .. and sometimes end points will accept it just the same as a POST. If it's just `http` then sending it via POST if it's accepted will at least eliminate your URL from getting logged everywhere with that token in it. If you insist on parsing the HTML and it really is just a large simple table, take a look at HTML::TableExtract.	[reply] [d/l] [select]
Re^3: Parsing a large html with perl by marto (Cardinal) on Jun 03, 2020 at 07:49 UTC
Usually makes more sense to reply to OP if that is who you are addressing. Your advice assumes they have API access, which may not be the case. The Mojo solution provided can deal just as easily with a JSON response as the HTML.	[reply]
Re^3: Parsing a large html with perl by zesys (Novice) on Jun 04, 2020 at 05:16 UTC
Thanks @perlfan. I will try your first suggestion. I admit, as a non-developer, I often find it a daunting task making sense of a JSON response.	[reply]
Re^4: Parsing a large html with perl [JSON Tips] by kcott (Archbishop) on Jun 06, 2020 at 09:04 UTC
Re: Parsing a large html with perl by jo37 (Deacon) on Jun 02, 2020 at 20:15 UTC
Just don't use a regex for HTML parsing. See Why a regex really isn't good enough for HTML and XML, even for "simple" tasks Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.

Back to Seekers of Perl Wisdom