Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Quick 'n dirty extraction of JSON from an HTML page

by davies (Prior)
on Mar 08, 2021 at 20:30 UTC ( [id://11129352]=note: print w/replies, xml ) Need Help??


in reply to Quick 'n dirty extraction of JSON from an HTML page

I think it would help to split your problem space conceptually between the scraping and the parsing. As far as scraping is concerned, Selenium is a very good tool for automating multiple browsers and testing against them. If all you need is a single browser, look at, say, WWW::Mechanize::Chrome. But do you actually need a browser? If not, LWP is probably all you need. And Dave Cross is the publisher of that book, not the author.

On to the parsing, I have tried a cut down version of your JSON. My code is:

use strict; use warnings; use JSON::PP; use Data::Dumper; my $scrape = <<EOF; <script> $(function () { var opportunity = new US.Opportunity.CandidateOpportunityDetail({"Id":"10eb1d6c-359b +-4f10-84d0-ca2525d88cce","Title":"Relationship Manager","Featured":fa +lse,"FullTime":true,"HoursPerWeek":null,"JobCategoryName":"Qualified +Client Services","Locations":[{"Id":"dd1188b1-18d2-5e8d-9f93-aadbe1a3 +fd22","LocalizedName":"CA-Remote","LocalizedLocationId":null,"Localiz +edDescription":"CA - Remote"}] }); EOF $scrape =~ m/\((\{.*\})\)/gms; my $json = $1; my $ref = decode_json $json; print Dumper $ref;

Does that give you what you need? If not, you may need to specify your problem more clearly.

Regards,

John Davies

Replies are listed 'Best First'.
Re^2: Quick 'n dirty extraction of JSON from an HTML page
by davebaker (Pilgrim) on Mar 08, 2021 at 22:20 UTC
    Yes, it certainly does give me what I need. Thanks, John!

    Some of the JavaScript seems to be using key/value specifications that aren't valid JSON because the keys aren't quoted strings, e.g.

    var renderer = new US.Opportunity.OpportunityRenderViewModel({ opportunity: opportunity, currentJobBoardId: "6162c253-9d81-da08-c252-d43d2fcb8345", isViewingInternal: false });
    ... so I changed the regular expression to be
    m/\((\{".*?\})\)/gms
    (throwing in a leading quotation mark, in order to find only JSON that has a quoted initial key).

    I also played with the possibility that the HTML page would contain more than one block of JSON, and changed your code to be

    my ( $json, $ref ); for ( $scrape =~ m/\((\{".*?\})\)/gms ) { $json = $1; $ref = decode_json $json; print Dumper $ref; }
    ...so as to find and print for me each of multiple JSON blocks (not shown here). Love it!

      Consider using the original regexp, which doesn't require keys to be quoted, and parsing the JSON using Cpanel::JSON::XS and turning relaxed mode.

      Javascript objects can of course still include values which cannot be encoded into JSON, for example:

      var obj = { "some_key": Date.now(), "other_key": function () { console.log("Hello world"); } };

      So if your Javascript objects contain things like this, you'll be out of luck. You might want to wrap your JSON decoding in try/catch or eval.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11129352]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-04-24 08:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found