in reply to perl Mojo DOM CSS syntax issues
Scraping data from HTML and/or text is best avoided, but I have found times where it was the only way, or even where the cost of avoiding it was higher than the cost of maintaining it. For instance, I know one company who provides their API for $40,000/year; the alternative was for users of that system to export reports of their own data and send to us to scrape and import to our system, and it was actually much more economical to pay me several dozen hours to make a really clever importer that identifies the data with heuristics (and repair it a few times when the format changed), than to pay that annual API fee.
(It's been working now for about 6 years since the last time I needed to edit it, actually. Some of that project is now on CPAN as Data::TableReader and Data::TableReader::Decoder::HTML, though that relies on HTML table elements and it looks like you need to match DIVs.)
So. Supposing that you have a legitimate case of really needing to scrape the data, here's my advice:
The idea here is to isolate the ugly high-maintenance piece of the script and give it a unit test so that later you can quickly compare what changed and get it working again without the top level script getting in your way. It's also way easier to explain the problem to a new developer when they can see and use the unit test.
Your module will look something like this:
And the unit test:package ScrapeMyData; use Moo; use Mojo::DOM; use v5.36; has input => ( is => 'ro', required => 1 ); =head2 parse This method returns the extracted data from L</input>. It first looks for blah blah to identify the start of the data, then looks for blah blah blah blah to identify the individual lines. Unfortunately I couldn't identify a pattern in the DIVs so I'm matching the class names and this is very likely to break. TODO: scan the file for the most common class="..." and then guess that that class is the one used on data rows. =cut sub parse($self) { my $dom = Mojo::DOM->new( $self->input ); ... # return an arrayref of data } 1;
Then back to the original script:use Test2::V0; use v5.36; use ScrapeMyData; my $scraper= ScrapeMyData->new(input => <<~'END'); <html> <head>...</head> <body> ... <div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf"> Sam Namett, MD - Physician - Interventional Orthopedics ...Exosomes are nanovesicles (30-200 nm) found in extracellular space of various cell types, and in biofluids; having diverse functions including intracellular ... </div></div> ... </body> </html> END is( $scraper->parse, [ ... { author => 'Sam Namett', title => 'Interventional Orthopedics ...', } ... ], 'parse' );
foreach (@files7){ print "Newparse2 == Parsing file: $_ \n"; my $scraper= ScapeMyData->new(input => path($_)->slurp_utf8); my $data= $scraper->parse;
Hope that helps.
|
|---|