Re: perl Mojo DOM CSS syntax issues

Like stevieb says, this script sounds highly prone to breakage. In particular, <div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf"> looks like it was auto-generated by some tool on the remote end, so you can expect those classes to change to new random strings any time the remote side gets recompiled.

Scraping data from HTML and/or text is best avoided, but I have found times where it was the only way, or even where the cost of avoiding it was higher than the cost of maintaining it. For instance, I know one company who provides their API for $40,000/year; the alternative was for users of that system to export reports of their own data and send to us to scrape and import to our system, and it was actually much more economical to pay me several dozen hours to make a really clever importer that identifies the data with heuristics (and repair it a few times when the format changed), than to pay that annual API fee.

(It's been working now for about 6 years since the last time I needed to edit it, actually. Some of that project is now on CPAN as Data::TableReader and Data::TableReader::Decoder::HTML, though that relies on HTML table elements and it looks like you need to match DIVs.)

So. Supposing that you have a legitimate case of really needing to scrape the data, here's my advice:

Write a perl module that does nothing more and nothing less than take the html and extract the data from it.
Design that code to focus on the structure of the HTML and not those generated div class names... unless the class names are really the only thing available.
Write a unit test that starts with a reduced snippet of the HTML and verifies that it gets back the expected data.
Document all your logic about the scraping
Refactor the top-level script to use that module for the step of extracting the data.

The idea here is to isolate the ugly high-maintenance piece of the script and give it a unit test so that later you can quickly compare what changed and get it working again without the top level script getting in your way. It's also way easier to explain the problem to a new developer when they can see and use the unit test.

Your module will look something like this:

package ScrapeMyData;
use Moo;
use Mojo::DOM;
use v5.36;

has input => ( is => 'ro', required => 1 );

=head2 parse

This method returns the extracted data from L</input>.

It first looks for blah blah to identify the start of the data,
then looks for blah blah blah blah to identify the individual lines.
Unfortunately I couldn't identify a pattern in the DIVs so I'm
matching the class names and this is very likely to break.
TODO: scan the file for the most common class="..."
and then guess that that class is the one used on data rows.

=cut

sub parse($self) {
  my $dom = Mojo::DOM->new( $self->input );
  ... # return an arrayref of data
}

1;
[download]

And the unit test:

use Test2::V0;
use v5.36;

use ScrapeMyData;

my $scraper= ScrapeMyData->new(input => <<~'END');
  <html>
  <head>...</head>
  <body>
  ...
  <div class="JMWMJ"><div class="toI8Rb OSrXXb usbThf">
  Sam Namett, MD - Physician - Interventional Orthopedics
  ...Exosomes are nanovesicles (30-200 nm) found in extracellular
  space of various cell types, and in biofluids; having diverse
  functions including intracellular ...
  </div></div>
  ...
  </body>
  </html>
  END

is( $scraper->parse,
  [
    ...
    { author => 'Sam Namett',
      title => 'Interventional Orthopedics ...',
    }
    ...
  ],
  'parse'
);
[download]

Then back to the original script:

  foreach (@files7){
    print "Newparse2 == Parsing file: $_ \n";
    my $scraper= ScapeMyData->new(input => path($_)->slurp_utf8);
    my $data= $scraper->parse;
[download]

Hope that helps.

Comment on Re: perl Mojo DOM CSS syntax issues Select or Download Code