Re: Quick 'n dirty extraction of JSON from an HTML page

I think it would help to split your problem space conceptually between the scraping and the parsing. As far as scraping is concerned, Selenium is a very good tool for automating multiple browsers and testing against them. If all you need is a single browser, look at, say, WWW::Mechanize::Chrome. But do you actually need a browser? If not, LWP is probably all you need. And Dave Cross is the publisher of that book, not the author.

On to the parsing, I have tried a cut down version of your JSON. My code is:

use strict;
use warnings;
use JSON::PP;
use Data::Dumper;

my $scrape = <<EOF;
<script>

    $(function () {

        var opportunity = new
        US.Opportunity.CandidateOpportunityDetail({"Id":"10eb1d6c-359b
+-4f10-84d0-ca2525d88cce","Title":"Relationship Manager","Featured":fa
+lse,"FullTime":true,"HoursPerWeek":null,"JobCategoryName":"Qualified 
+Client Services","Locations":[{"Id":"dd1188b1-18d2-5e8d-9f93-aadbe1a3
+fd22","LocalizedName":"CA-Remote","LocalizedLocationId":null,"Localiz
+edDescription":"CA - Remote"}]
    });
EOF
$scrape =~ m/\((\{.*\})\)/gms;
my $json = $1;
my $ref = decode_json $json;
print Dumper $ref;
[download]

Does that give you what you need? If not, you may need to specify your problem more clearly.

Regards,

John Davies

Comment on Re: Quick 'n dirty extraction of JSON from an HTML page Download Code

Replies are listed 'Best First'.
Re^2: Quick 'n dirty extraction of JSON from an HTML page by davebaker (Pilgrim) on Mar 08, 2021 at 22:20 UTC
Yes, it certainly does give me what I need. Thanks, John! Some of the JavaScript seems to be using key/value specifications that aren't valid JSON because the keys aren't quoted strings, e.g. `var renderer = new US.Opportunity.OpportunityRenderViewModel({ opportunity: opportunity, currentJobBoardId: "6162c253-9d81-da08-c252-d43d2fcb8345", isViewingInternal: false });` [download] ... so I changed the regular expression to be `m/$(\{".?\})$/gms` [download] (throwing in a leading quotation mark, in order to find only JSON that has a quoted initial key). I also played with the possibility that the HTML page would contain more than one block of JSON, and changed your code to be `my ( $json, $ref ); for ( $scrape =~ m/$(\{".?\})$/gms ) { $json = $1; $ref = decode_json $json; print Dumper $ref; }` [download] ...so as to find and print for me each of multiple JSON blocks (not shown here). Love it!	[reply] [d/l] [select]
Re^3: Quick 'n dirty extraction of JSON from an HTML page by tobyink (Canon) on Mar 09, 2021 at 14:29 UTC
Consider using the original regexp, which doesn't require keys to be quoted, and parsing the JSON using Cpanel::JSON::XS and turning relaxed mode. Javascript objects can of course still include values which cannot be encoded into JSON, for example: `var obj = { "some_key": Date.now(), "other_key": function () { console.log("Hello world"); } };` [download] So if your Javascript objects contain things like this, you'll be out of luck. You might want to wrap your JSON decoding in `try`/`catch` or `eval`. Hire me at Toby Ink Ltd or Join my OnlyFans	[reply] [d/l] [select]