BrentD has asked for the wisdom of the Perl Monks concerning the following question:

I deal with a web site that publishes documents that I need to retrieve and archive. The web page is operated by a third party, so I have no control over the formatting of the page. The page used to simply have a list of links to the individual documents. I used WWW:Mechanize find_all_links with a regex to pull the documents I need every evening.

Unfortunately, the web designer decided to get "clever" and moved the documents to a dropdown list that calls a bit of java script that links to the file.

The source of the dropdown is like this:
<select name="jumpMenu2" id="jumpMenu2" onchange="MM_jumpMenu('parent' +,this,0)"> <option selected="selected">Choose One...</option> <option value="docs/foreclosure/2018/June/Lots 17 &amp; 18 Dyer Ad +dition Rockdale June 5, 2018.pdf">Lots 17 &amp; 18 Dyer Addition Rock +dale June 5, 2018</option> <option value="docs/foreclosure/2018/June/Lot 8 Blk 4 Revised Dyer + Addition 6-5-2018.pdf">Lot 8 Blk 4 Revised Dyer Addition Rockdale Ju +ne 5, 2018</option> <option value="docs/foreclosure/2018/June/0.21 acre tract, Daniel +Monroe Survey June 5, 2018.pdf">0.21 acre tract, Daniel Monroe Survey + June 5, 2018</option> <option value="docs/foreclosure/2018/June/Lot 8 Blk 3 Westwood Add +itiion Rockdale June 5, 2018.pdf">Lot 8 Blk 3 Westwood Additiion Rock +dale June 5, 2018</option> <option value="docs/foreclosure/2018/June/25 acre tract June 5, 20 +18.pdf">25 acre tract June 5, 2018</option> <option value="docs/foreclosure/2018/June/Lot 1 Blk 121 Rockdale 6 +-5-2018.pdf">Lot 1 Blk 121 Rockdale June 5, 2018</option> <option value="docs/foreclosure/2018/June/Lot 2 &amp; West half Lo +t 4, Bluebird Heights, Sec 1, Rockdale.pdf">Lot 2 &amp; West half Lot + 4, Bluebird Heights, Sec 1, Rockdale June 5, 2018</option> <option value="docs/foreclosure/2018/June/6,300 square ft tract Da +niel Monroe Survey 6-5-2018.pdf">6,300 square ft tract Daniel Monroe +Survey June 5, 2018</option> <option value="docs/foreclosure/2018/July/105 N Johnson.pdf">105 N + Johnson St. T'dale July 3, 2018</option> </select>
The Javascript it links to looks like this:
<script type="text/javascript"> function MM_jumpMenu(targ,selObj,restore){ //v3.0 eval(targ+".location='"+selObj.options[selObj.selectedIndex].value+" +'"); if (restore) selObj.selectedIndex=0; } </script>
so, for example, I can go to the first document in the list by going to http://www.website.com/docs/foreclosure/2018/June/Lots 17 & 18 Dyer Addition Rockdale June 5, 2018.pdf in my browser.

What I need is a way to pull all the value options from the so I can prepend the website address and download the files. What is the best way to do this? My Google-Foo is failing me. All I seem to be able to find is info on building the list boxes.

Replies are listed 'Best First'.
Re: Scrape Select Options from Web Page
by haukex (Archbishop) on Jun 05, 2018 at 07:23 UTC

    There are several options for parsing HTML, for something like this it seems easiest to use Mojo::DOM:

    my $dom = Mojo::DOM->new($html); my $values = $dom->find('select#jumpMenu2 > option') ->map(attr=>'value')->compact; # returns a Mojo::Collection for my $v (@$values) { print "<<$v>>\n"; }

    Update: But it's also possible with WWW::Mechanize directly, see my other reply.

      Got this one to work. Thank you.
Re: Scrape Select Options from Web Page
by Corion (Patriarch) on Jun 05, 2018 at 08:22 UTC

    You could use one of the WWW::Mechanize backends that supports Javascript to trigger the download of the PDFs.

    Also, you could look at what the ->current_form returns (a HTML::Form) and then look through the potential values of a field with the ->try_others method.

    Personally, I would just extract the complete <select> HTML using (for example) HTML::TreeBuilder::XPath and query that tree.

Re: Scrape Select Options from Web Page
by NetWallah (Canon) on Jun 05, 2018 at 00:01 UTC
    This is essentially an XML data extraction exercise.

    Use a decent XML module that has 'xpath' , or something like XML::Twig - and this should take under 10 lines of code.

                    Memory fault   --   brain fried

Re: Scrape Select Options from Web Page
by haukex (Archbishop) on Jun 05, 2018 at 08:28 UTC
    I used WWW:Mechanize

    This works too:

    my @files = grep {!/choose one/i} $mech->form_number(1) ->find_input('jumpMenu2')->possible_values;
      This isn't working. I think, possibly because the page designer doesn't have an explicit <form> tag in the page.
        I think, possibly because the page designer doesn't have an explicit <form> tag in the page.

        Yes, it does appear that WWW::Mechanize requires the <form> tags to recognize inputs. By your reply to my other solution, I assume you've already figured out that you can feed the HTML to Mojo::DOM with e.g. my $dom = Mojo::DOM->new($mech->content);.