Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Parsing a large html with perl

by haukex (Archbishop)
on Jun 02, 2020 at 21:23 UTC ( [id://11117612]=note: print w/replies, xml ) Need Help??


in reply to Parsing a large html with perl

Welcome to the Monastery, zesys!

The top of that page says:

The following is dynamic list of all of the deployments that have data. It is being pulled from the deployments web service using the URL https://data.oceannetworks.ca/api/deployments?method=get&token=[YOUR_TOKEN_HERE]

Why don't you just use that API?

Anyway, if you need to parse HTML, then don't use regular expressions. Here's an example with Mojo::DOM:

use warnings; use strict; use Mojo::UserAgent; use Mojo::DOM; my $ua = Mojo::UserAgent->new( max_redirects=>3 ); my $dom = $ua->get( 'https://wiki.oceannetworks.ca/display/O2A/Available+Deployments' )->result->dom; $dom->find('.confluenceTable tr')->each(sub { my $tr = shift; my ($locationCode, $deviceCode, $dateFrom, $dateTo) = map { $tr->find(".confluenceTd:nth-of-type($_)") ->map('all_text')->join } 1..4; print "locationCode=$locationCode, deviceCode=$deviceCode, ", "dateFrom=$dateFrom, dateTo=$dateTo\n"; });

Replies are listed 'Best First'.
Re^2: Parsing a large html with perl
by zesys (Novice) on Jun 04, 2020 at 04:42 UTC

    Thanks so much @haukex. I have added two lines of code to yours (had two questions), and problem solved!

    Regarding the API, I use the service using client libraries written for python, almost everyday. I just wanted to do things differently this time by using Perl, for which the organisation does not seem to have a client library.

    Thank you all for your prompt answers and suggestions!!

      You don't need them to provide a client library in perl, writing your own is reasonably straightforward. The advantage of using their API is that generally speaking they are less suceptable to change than a webpage. Super Search for mojo api will find results to get you started.

Re^2: Parsing a large html with perl
by perlfan (Vicar) on Jun 03, 2020 at 04:17 UTC
    OP, please do use the URL at https://wiki.oceannetworks.ca/display/O2A/API+Reference that haukex pointed out.
    • it's a HTTP::Tiny call away! (hopefully an https URL is available)
    • it's JSON!
    • you'll learn a lot and be glad you did

    Note:

    If you do it right, you could get a Perl client listed in there. Also, see if it'll accept the query string via POST body, be sure to set your content-type header in the request to be application/x-www-form-urlencoded. Reason is, sending your special token via GET request is gonna get it logged everywhere and it's not protected by https .. and sometimes end points will accept it just the same as a POST. If it's just http then sending it via POST if it's accepted will at least eliminate your URL from getting logged everywhere with that token in it.

    If you insist on parsing the HTML and it really is just a large simple table, take a look at HTML::TableExtract.

      Usually makes more sense to reply to OP if that is who you are addressing. Your advice assumes they have API access, which may not be the case. The Mojo solution provided can deal just as easily with a JSON response as the HTML.

      Thanks @perlfan. I will try your first suggestion. I admit, as a non-developer, I often find it a daunting task making sense of a JSON response.

        G'day zesys,

        Welcome to the Monastery.

        "... I often find it a daunting task making sense of a JSON response."

        You don't say what aspects of this you find daunting. Here's a few tips.

        JSON is often presented as a single string many hundreds or thousands of characters long. I typically find this impossible to read at a glance; no doubt, you do too. The solution is to format that string into a more humanly readable structure. I use "JSON Formatter and Validator" for this; if you don't like that one, there are many others available, so just search for something that better suits you.

        Now that you have a readable structure, just think of each ':' as a '=>' and you have a Perl hashref. That's a slight oversimplification but, in nearly all cases, it will hold true.

        # JSON: { "string" : "value", "array" : [ 1, 2, 3 ], "hash" : { "key1" : "val1", "key2" : "val2" } } # Perl: { "string" => "value", "array" => [ 1, 2, 3 ], "hash" => { "key1" => "val1", "key2" => "val2" } }

        The JSON syntax is actually very simple. It's described, clearly and succinctly, in "Introducing JSON".

        If you're not completely familiar with hashrefs, take a look at the Hashes section of "perlintro: Perl variable types". That section — indeed, the entirety of the perlintro page — is peppered with with links to more detailed descriptions, additional information, and more advanced, related topics: don't be put off by the idea that this page is just an introduction for complete novices.

        There's also a few gotchas which may not be immediately obvious; in some cases, they're highly unintuitive. Here's a couple that have tripped me up in the past:

        • Valid JavaScript is not necessarily valid JSON. Strings in JSON must be delimited by double-quotes, so { "answer": 42 } is valid in both. These, however, are valid in JavaScript but not in JSON: { 'answer': 42 } and { answer: 42 }.
        • In Perl, the final element in a list may be optionally followed by a comma; in JSON, that final comma is not allowed. So, [ 1, 2, 3 ] is valid in both; however, [ 1, 2, 3, ] is valid Perl but invalid JSON.

        — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11117612]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (2)
As of 2024-04-20 05:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found