davebaker has asked for the wisdom of the Perl Monks concerning the following question:
I would like to be able to extract the principal information from each of several web pages that are job openings (one per page) on a particular employer's career site. Each page is created by a combination of a JavaScript front end and certain JSON information that is embedded in the page. Once I can extract the JSON, I think I can use one of the many CPAN JSON modules to turn the JSON into Perl data structures I can use to reformat the data for each job. Basically, I'd scrape each job for repurposing, with the employer's permission.
The page that contains links to each of the job openings is here: https://recruiting.ultipro.com/NEW1020/JobBoard/6162c253-9d81-da08-c252-d43d2fcb8345/?q=&o=postedDateDesc&w=&wc=&we=&wpst=
Each page containing a particular job opening is produced by clicking on a job title on that page.
So an example of the JSON data that I'd like to munge is this excerpt from one such job page (not the page that lists all of the jobs):
<script> $(function () { var opportunity = new US.Opportunity.CandidateOpportunityDetail({"Id":"10eb1d6c-359b +-4f10-84d0- ca2525d88cce","Title":"Relationship Manager","Featured":false,"FullTime":true,"HoursPerWeek":null, +"JobCategoryName":"Qualified Client Services","Locations":[{"Id":"dd1188b1-18d2-5e8d-9f93-a +adbe1a3fd22","LocalizedName":"CA - Remote","LocalizedLocationId":null,"LocalizedDescription":"C +A - Remote","Address": {"Line1":null,"Line2":null,"City":"Walnut Creek","State": {"Name":"California","Code":"CA"},"PostalCode":null,"Country": +{"Id":"ab896de2- c528-41b0-90a7-5eed39797103","Name":"United States","Code":"USA"}},"DisplayName":true,"DisplayLocationId": +false,"DisplayDescription":true, "DisplayAddress":false,"DisplayStreetAddress":false,"Coordinat +es": {"Longitude":-120.9614611155792,"Latitude":37.584818420647},"S +hapes":null,"SourceOfTruth":1,"I sAvailableForOpportunities":true},{"Id":"1945a6cf-0d3b-5b2b-a7 +bf- dd8dbb9a7b53","LocalizedName":"CA - Folsom","LocalizedLocationId":null,"LocalizedDescription":"CA +- Folsom","Address":{"Line1":"35 Iron Point Circle","Line2":"Suite 300","City":"Folsom","State" +: {"Name":"California","Code":"CA"},"PostalCode":"95630","Countr +y":{"Id":"ab896de2- c528-41b0-90a7-5eed39797103","Name":"United States","Code":"USA"}},"DisplayName":true,"DisplayLocationId": +false,"DisplayDescription":true, "DisplayAddress":true,"DisplayStreetAddress":false,"Coordinate +s": {"Longitude":-121.14320436989884,"Latitude":38.643310785875464 +},"Shapes":null,"SourceOfTruth": 1,"IsAvailableForOpportunities":true},{"Id":"ab91588e-c732-56b +4-9671- e5daab085388","LocalizedName":"CA - Los Angeles","LocalizedLocationId":null,"LocalizedDescription":"CA + - Los Angeles Wilshire","Address":{"Line1":"12424 Wilshire Blvd.","Line2":"S +uite 870","City":"Los Angeles","State":{"Name":"California","Code":"CA"},"PostalCode +":"90025","Country": {"Id":"ab896de2-c528-41b0-90a7-5eed39797103","Name":"United States","Code":"USA"}},"DisplayName":true,"DisplayLocationId": +false,"DisplayDescription":true, "DisplayAddress":false,"DisplayStreetAddress":false,"Coordinat +es": {"Longitude":-118.47060174630806,"Latitude":34.041507422395}," +Shapes":null,"SourceOfTruth":1," IsAvailableForOpportunities":true},{"Id":"dadf3d11-17f2-5753- b719-3291aeeccc69","LocalizedName":"CA - Fresno","LocalizedLocationId":null,"LocalizedDescription":"CA +- Fresno","Address": {"Line1":"7519 North Ingram Avenue","Line2":"Suite 106","City" +:"Fresno","State": {"Name":"California","Code":"CA"},"PostalCode":"93711","Countr +y":{"Id":"ab896de2- c528-41b0-90a7-5eed39797103","Name":"United States","Code":"USA"}},"DisplayName":true,"DisplayLocationId": +false,"DisplayDescription":true, "DisplayAddress":false,"DisplayStreetAddress":false,"Coordinat +es": {"Longitude":-119.80186387305908,"Latitude":36.846081098189643 +},"Shapes":null,"SourceOfTruth": 1,"IsAvailableForOpportunities":true}],"PostedDate":"2021-03-0 +3T16:52:20.236Z","UpdatedDate":" 2021-03-03T16:52:56.265Z","RequisitionNumber":"RELAT03025","De +scription":"\u003cp\u003e \u003cstrong\u003e\u003cem\u003eWho We Are\u003c/em\u003e\u003 +c/strong\u003e\u003c/p\u003e\n \u003cp\u003eNewport helps companies offer their associates a +more secure financial future through retirement plans, insurance and consulting services. N +ewport offers comprehensive plan solutions and consulting expertise to plan sponsors and the ad +visors who serve them. As a provider and partner, Newport is independent, experienced and +responsive.\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u0026nbsp;\u003c/p\u003e\n\u003cp\u00 +3e\u003cstrong\u003eJob Description\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePro +vides pro-active service and communications to retirement plan clients. This includes provi +ding client support, documentation and record keeping, preparation of plan statemen +ts, communication of plan information to client, and assists with the modification and e +nhancement of plan administration processes, within the limits of established pol +icy.\u003c/p\u003e\n\u003cp \u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u00 +3eEssential Functions \u003c/ strong\u003e\u003cem\u003eReasonable accommodations may be mad +e to enable individuals with disabilities to perform these essential functions\u003c/em\u00 +3e.\u003c/p\u003e\n\u003cul \u003e\n\u003cli\u003eProvides support to clients through a nu +mber of channels including phone, letters and emails to quickly resolve the request\u003c +/li\u003e\n\u003c/ul\u003e\n \u003cul\u003e\n\u003cli\u003eActs in a pro-active manner with + assigned clients and advisors to ensure retention as well as inspire client dedication and e +ngagement to develop positive relationships\u003c/li\u003e\n\u003cli\u003eResponsible for in +terpreting plan documents for client plan administration.\u0026nbsp;\u003c/li\u003e\n\u003cl +i\u003eProvides calculations and amounts to plan sponsors, communicates fund actions, consults +with clients to answer inquiries, researches and resolves issues, provides legal upda +tes, and responds to requests for specialized reports\u003c/li\u003e\n\u003cli\u003eAssists +plan sponsor and intermediaries on the utilization of web-based applications and delivers web +demonstrations for financial advisors and plan sponsors.\u003c/li\u003e\n\u003cli\u003eWork +s with clients to correct and fund payroll items and manages distribution requests.\u003c/li +\u003e\n\u003cli \u003eCoordinates plan compliance testing with the compliance +team.\u003c/li\u003e\n\u003cli \u003eParticipates in sales finals presentations and promotes +cross-sell opportunities as needed\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u0026nbs +p;\u003c/p\u003e\n\u003cp\u003e \u003cstrong\u003eSupervisory Responsibilities (none)\u003c/st +rong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\ +u003c/p\u003e\n\u003cp\u003e \u003cstrong\u003eRequired Education, Experience and Certifica +tes, Licenses, Registrations \u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u0 +03eBachelor\u0026rsquo;s degree in business related filed or combination of education and indu +stry experience\u003c/li\u003e\n \u003cli\u003e3-5 years of total experience in Retirement Serv +ices, with emphasis in the daily 401(k) environment, 403b or IRA areas\u003c/li\u003e\n\u003cli +\u003eStrong MS Office Skills with an emphasis in Excel\u003c/li\u003e\n\u003c/ul\u003e\n\u0 +03cp\u003e\u003cstrong \u003ePreferred (but not required) education or skills for thi +s role are\u003c/strong\u003e \u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003ePreferred ASPPA +or CEBS\u003c/li\u003e\n\u003c/ ul\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003 +e\u003cstrong\u003eCompetencies \u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u0 +03eThrives in a fast-paced environment\u003c/li\u003e\n\u003cli\u003eEmbraces personal gr +owth and wants to be challenged in deadline-driven and multi-component environment\u003c/li\u0 +03e\n\u003cli\u003eExcellent communication skills both written and verbal\u003c/li\u003e\n\ +u003cli\u003eBuilds collaborative relationships\u003c/li\u003e\n\u003c/ul\u003e\n\ +u003cul\u003e\n\u003cli \u003eEffective time management and organization skills\u003c/ +li\u003e\n\u003cli \u003eDemonstrates initiative\u003c/li\u003e\n\u003cli\u003eFo +rward thinking\u003c/li\u003e\n \u003cli\u003eFosters teamwork\u003c/li\u003e\n\u003cli\u003eR +esults drive/oriented\u003c/li \u003e\n\u003c/ul\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003 +e\n\u003cp\u003e\u003cstrong \u003eTRAVEL:\u0026nbsp; 10\u003c/strong\u003e%.\u003c/p\u003e +\n\u003cp\u003e\u003cstrong \u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u0 +03e\u003cstrong\u003eOTHER DUTIES\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePlease n +ote this job description is not designed to cover or contain a comprehensive listing of activi +ties, duties or responsibilities that are required of the employee for this job. Duties, respon +sibilities and activities may change at any time with or without notice.\u003c/p\u003e\n\u00 +3cp\u003e\u0026nbsp;\u003c/p \u003e\n\u003cp\u003e\u003cspan\u003e\u003cstrong\u003eEQUAL O +PPORTUNITY EMPLOYER\u003c/strong \u003e\u003c/span\u003e\u003c/p\u003e\n\u003cp\u003eNewport of +fers for employment are conditioned upon satisfactory completion of our employment scr +eening process (including, but not limited to, a review of past employment and education reco +rds, background investigation, and/or credit check and fingerprints.)\u003c/p\u003e\n\u003cp\ +u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eNewport unequivocally rejects racism and discrimi +nation of any kind and fosters an environment of belonging to provide access and opportunity +for all.\u0026nbsp; As an Equal Opportunity Employer we do not discriminate on the basis of ra +ce, religion, color, sex, sexual orientation, gender identify, gender expression, national orig +in, age, non-disqualifying physical or mental disability, veteran status, or any other ba +sis covered by applicable law. \u0026nbsp;All employment is decided on the basis of qualifica +tions, merit, and business need. \u003c/p \u003e","EqualOpportunityEmployerDescription":null,"PayTranspa +rencyPolicyStatement":null,"Matc hScore":1.0,"HasApplied":false,"ApplicationJobBoardName":null, +"ApplicationJobBoardId":null,"Da teApplied":null,"Salaried":true,"CompensationAmount":null,"Pub +lishingStatus":1,"Links": [],"BehaviorCriteria":[],"MotivationCriteria":[],"EducationCri +teria": [],"LicenseAndCertificationCriteria":[],"SkillCriteria":[],"Wo +rkExperienceCriteria": [],"JobBoardMemberships":[{"JobBoardId":"6489e35d-ba29-b1c3-92 +d3- acb1a86c1453","PublishedInternal":true,"PublishedExternal":fal +se,"ExternalPostedDate":null,"In ternalPostedDate":"2021-03-05T23:08:36.109Z"},{"JobBoardId":"6 +162c253-9d81-da08-c252- d43d2fcb8345","PublishedInternal":true,"PublishedExternal":tru +e,"ExternalPostedDate":"2021-03- 05T23:08:36.109Z","InternalPostedDate":"2021-03-05T23:08:36.10 +9Z"}],"AssessmentUri":null,"Asse ssmentStatus":null,"OpportunityIsClosed":false,"TravelRequired +":null,"TravelDescription":null, "SupervisorName":null,"Assessments": [],"ApplicationId":null,"CompensationAnnualMinimum":null,"Comp +ensationAnnualMaximum":null,"Com pensationHourlyMinimum":null,"CompensationHourlyMaximum":null, +"CompensationCurrency":null}); var applicantSourceId = null; if (applicantSourceId) { US.utils.sessionStorage.setItem("applicantSourceId", appli +cantSourceId); } var renderer = new US.Opportunity.OpportunityRenderViewModel({ opportunity: opportunity, currentJobBoardId: "6162c253-9d81-da08-c252-d43d2fcb8345", isViewingInternal: false }); US.CurrentOpportunityDetailViewModel = new US.Opportunity.Oppo +rtunityDetailViewModel({ currentJobBoardId: "6162c253-9d81-da08-c252-d43d2fcb8345", opportunity: opportunity, renderer: renderer, candidatePresenceState: null, opportunityApplyRedirectUrl: "/NEW1020/JobBoard/6162c253-9 +d81-da08-c252-d43d2fcb8345/ Account/Register?redirectUrl=%2FNEW1020%2FJobBoard%2F6162c +253-9d81-da08-c252- d43d2fcb8345%2FOpportunityApply%3FopportunityId%3D10eb1d6c +-359b-4f10-84d0-ca2525d88cce \u0026cancelUrl=%2FNEW1020%2FJobBoard%2F6162c253-9d81-da08 +-c252- d43d2fcb8345%2FOpportunityDetail%3FopportunityId%3D10eb1d6 +c-359b-4f10-84d0-ca2525d88cce", opportunityApplyOnBehalfRedirectUrl: "/NEW1020/JobBoard/61 +62c253-9d81-da08-c252- d43d2fcb8345/Recruiter/Candidates", opportunitiesUrl: "/NEW1020/JobBoard/6162c253-9d81-da08-c2 +52-d43d2fcb8345", tenantAlias: "NEW1020", featureConfigurationGroups: [{"Id":"001605e9-e513-bcd7-6a0 +5- b020c4e16539","Name":"Recruitment.OpportunityManagement.Pu +blishingAndJobBoards","Features" : [{"Name":"FeaturedOpportunities","Enabled":true,"HelpToolt +ipMessageKey":null,"TurnOffWarni ngMessageKey":null,"ConsentMessageKey":null,"ConsentTitleK +ey":null,"ToggleableFeature":nul l}, {"Name":"Approvals","Enabled":false,"HelpTooltipMessageKey +":null,"TurnOffWarningMessageKey ":null,"ConsentMessageKey":null,"ConsentTitleKey":null,"To +ggleableFeature":null}, {"Name":"Parallel","Enabled":false,"HelpTooltipMessageKey" +:null,"TurnOffWarningMessageKey" :null,"ConsentMessageKey":null,"ConsentTitleKey":null,"Tog +gleableFeature":null}, {"Name":"IncludeHiringManagersInOnboardingOwnerField","Ena +bled":true,"HelpTooltipMessageKe y":null,"TurnOffWarningMessageKey":null,"ConsentMessageKey +":null,"ConsentTitleKey":null,"T oggleableFeature":null}, {"Name":"FTE","Enabled":false,"HelpTooltipMessageKey":"Rec +ruitmentAdministrator.FieldConfi gurationManager.FeatureConfiguration.Recruitment.Opportuni +tyManagement.PublishingAndJobBoa rds.FTEHelpTooltip","TurnOffWarningMessageKey":"Recruitmen +tAdministrator.FieldConfiguratio nManager.FeatureConfiguration.Recruitment.OpportunityManag +ement.PublishingAndJobBoards.FTE DisableWarningMessage","ConsentMessageKey":null,"ConsentTi +tleKey":null,"ToggleableFeature" :null}, {"Name":"Evergreen","Enabled":true,"HelpTooltipMessageKey" +:"RecruitmentAdministrator.Field ConfigurationManager.FeatureConfiguration.Recruitment.Oppo +rtunityManagement.PublishingAndJ obBoards.EvergreenHelpTooltip","TurnOffWarningMessageKey": +null,"ConsentMessageKey":null,"C onsentTitleKey":null,"ToggleableFeature":null}, {"Name":"IncludeHiringManagersInRecruiterField","Enabled": +false,"HelpTooltipMessageKey":nu ll,"TurnOffWarningMessageKey":null,"ConsentMessageKey":nul +l,"ConsentTitleKey":null,"Toggle ableFeature":null}]},{"Id":"772f9900-a307-4d31- b15e-9e9052f1c897","Name":"Recruitment.OpportunityManageme +nt.PageFeatures","Features": [{"Name":"PersonalizedJobSearch","Enabled":false,"HelpTool +tipMessageKey":"RecruitmentAdmin istrator.FieldConfigurationManager.FeatureConfiguration.Re +cruitment.OpportunityManagement. PageFeatures.PersonalizedJobSearchTooltip","TurnOffWarning +MessageKey":null,"ConsentMessage Key":null,"ConsentTitleKey":null,"ToggleableFeature":null} +, {"Name":"JobSearchAgent","Enabled":true,"HelpTooltipMessag +eKey":"RecruitmentAdministrator. FieldConfigurationManager.FeatureConfiguration.Recruitment +.OpportunityManagement.PageFeatu res.JobSearchAgentTooltip","TurnOffWarningMessageKey":null +,"ConsentMessageKey":"Recruitmen tAdministrator.FieldConfigurationManager.FeatureConfigurat +ion.Recruitment.OpportunityManag ement.PageFeatures.JobSearchAgentConsentMessage","ConsentT +itleKey":"RecruitmentAdministrat or.FieldConfigurationManager.FeatureConfiguration.Recruitm +ent.OpportunityManagement.PageFe atures.JobSearchAgentConsentTitle","ToggleableFeature":nul +l}]}], linkedInRedirectUrl: "https://recruiting.ultipro.com/NEW10 +20/Opportunity/ ApplyWithLinkedIn?jobBoardId=6162c253-9d81-da08-c252- d43d2fcb8345\u0026opportunityId=10eb1d6c-359b-4f10-84d0-ca +2525d88cce", currentUserRequiresReconsent: false, userIsRecOrHM: false, loggedInPersonName: "", assessmentsUrl: "/NEW1020/JobBoard/6162c253-9d81-da08-c252 +-d43d2fcb8345/ApplicationAssessments" }); });
I wonder if other Monks are using such a technique to scrape embedded JSON data from a web page.
I see many JSON modules on CPAN, but I'm not finding any that will take ugly HTML and filter it for the embedded JSON.
I see Randal Schwartz' marvelous regex-that-would-be-king that would seem to meet my need at https://perlmonks.org/?node_id=995856, and perlancar's use of it to make the JSON::Decode::Regex module, but I haven't been able to make 'em work. I can provide details here if you'd like, but I'll skip them because I recognize how brittle the regex approach must be. (But if I'm wrong and it's worth pursuing, please let me know.)
Moving on to what must be the "right" way to do it, it appears that I'd learn how to use Selenium to basically process all the JavaScript and get an HTML page that would be parseable by Mojo::DOM, if I want to stick with Perl.
I also see many API software-as-a-service vendors -- a whole industry, practically -- where the vendors essentially have figured all this out, and are happy to extract data from web pages, turn it into JSON, and make it accessible, for a fee, via an API. That's another way to go, but I'd love to be able to do it myself, especially since I see some nice JSON data already hanging on the tree in my target HTML pages.
I also see articles for doing this sort of thing using Python and node, but so far I haven't found a similarly comprehensive article suing Perl -- e.g. https://levelupprogramming.net/how-to-scrap-data-from-javascript-based-website-using-python-selenium-and-headless-web-driver-531c7fe0c01f and https://dev.to/princepeterhansen/how-to-scrape-html-from-a-website-built-with-javascript-mjn
What do you think? Should I go with Selenium and Mojo::DOM? I also see Dave Cross' book on using Selenium and Perl, so I'd probably tap that as a resource.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Quick 'n dirty extraction of JSON from an HTML page
by davies (Monsignor) on Mar 08, 2021 at 20:30 UTC | |
by davebaker (Pilgrim) on Mar 08, 2021 at 22:20 UTC | |
by tobyink (Canon) on Mar 09, 2021 at 14:29 UTC | |
|
Re: Quick 'n dirty extraction of JSON from an HTML page
by bliako (Abbot) on Mar 09, 2021 at 19:25 UTC |