Scrape Select Options from Web Page

BrentD has asked for the wisdom of the Perl Monks concerning the following question:

I deal with a web site that publishes documents that I need to retrieve and archive. The web page is operated by a third party, so I have no control over the formatting of the page. The page used to simply have a list of links to the individual documents. I used WWW:Mechanize find_all_links with a regex to pull the documents I need every evening.

Unfortunately, the web designer decided to get "clever" and moved the documents to a dropdown list that calls a bit of java script that links to the file.

The source of the dropdown is like this:

<select name="jumpMenu2" id="jumpMenu2" onchange="MM_jumpMenu('parent'
+,this,0)">
    <option selected="selected">Choose One...</option>
    <option value="docs/foreclosure/2018/June/Lots 17 &amp; 18 Dyer Ad
+dition Rockdale June 5, 2018.pdf">Lots 17 &amp; 18 Dyer Addition Rock
+dale June 5, 2018</option>
    <option value="docs/foreclosure/2018/June/Lot 8 Blk 4 Revised Dyer
+ Addition 6-5-2018.pdf">Lot 8 Blk 4 Revised Dyer Addition Rockdale Ju
+ne 5, 2018</option>
    <option value="docs/foreclosure/2018/June/0.21 acre tract, Daniel 
+Monroe Survey June 5, 2018.pdf">0.21 acre tract, Daniel Monroe Survey
+ June 5, 2018</option>
    <option value="docs/foreclosure/2018/June/Lot 8 Blk 3 Westwood Add
+itiion Rockdale June 5, 2018.pdf">Lot 8 Blk 3 Westwood Additiion Rock
+dale June 5, 2018</option>
    <option value="docs/foreclosure/2018/June/25 acre tract June 5, 20
+18.pdf">25 acre tract June 5, 2018</option>
    <option value="docs/foreclosure/2018/June/Lot 1 Blk 121 Rockdale 6
+-5-2018.pdf">Lot 1 Blk 121 Rockdale June 5, 2018</option>
    <option value="docs/foreclosure/2018/June/Lot 2 &amp; West half Lo
+t 4, Bluebird Heights, Sec 1, Rockdale.pdf">Lot 2 &amp; West half Lot
+ 4, Bluebird Heights, Sec 1, Rockdale June 5, 2018</option>
    <option value="docs/foreclosure/2018/June/6,300 square ft tract Da
+niel Monroe Survey 6-5-2018.pdf">6,300 square ft tract Daniel Monroe 
+Survey June 5, 2018</option>
    <option value="docs/foreclosure/2018/July/105 N Johnson.pdf">105 N
+ Johnson St. T'dale July 3, 2018</option>
</select>
[download]

The Javascript it links to looks like this:

<script type="text/javascript">
function MM_jumpMenu(targ,selObj,restore){ //v3.0
  eval(targ+".location='"+selObj.options[selObj.selectedIndex].value+"
+'");
  if (restore) selObj.selectedIndex=0;
}
</script>
[download]

so, for example, I can go to the first document in the list by going to http://www.website.com/docs/foreclosure/2018/June/Lots 17 & 18 Dyer Addition Rockdale June 5, 2018.pdf in my browser.

What I need is a way to pull all the value options from the so I can prepend the website address and download the files. What is the best way to do this? My Google-Foo is failing me. All I seem to be able to find is info on building the list boxes.

Comment on Scrape Select Options from Web Page Select or Download Code

Replies are listed 'Best First'.
Re: Scrape Select Options from Web Page by haukex (Archbishop) on Jun 05, 2018 at 07:23 UTC
There are several options for parsing HTML, for something like this it seems easiest to use Mojo::DOM: `my $dom = Mojo::DOM->new($html); my $values = $dom->find('select#jumpMenu2 > option') ->map(attr=>'value')->compact; # returns a Mojo::Collection for my $v (@$values) { print "<<$v>>\n"; }` [download] Update: But it's also possible with WWW::Mechanize directly, see my other reply.	[reply] [d/l]
Re^2: Scrape Select Options from Web Page by BrentD (Sexton) on Jun 06, 2018 at 21:30 UTC
Got this one to work. Thank you.	[reply]
Re: Scrape Select Options from Web Page by Corion (Patriarch) on Jun 05, 2018 at 08:22 UTC
You could use one of the WWW::Mechanize backends that supports Javascript to trigger the download of the PDFs. Also, you could look at what the `->current_form` returns (a HTML::Form) and then look through the potential values of a field with the `->try_others` method. Personally, I would just extract the complete `<select>` HTML using (for example) HTML::TreeBuilder::XPath and query that tree.	[reply] [d/l] [select]
Re: Scrape Select Options from Web Page by NetWallah (Canon) on Jun 05, 2018 at 00:01 UTC
This is essentially an XML data extraction exercise. Use a decent XML module that has 'xpath' , or something like XML::Twig - and this should take under 10 lines of code. Memory fault -- brain fried	[reply]
Re: Scrape Select Options from Web Page by haukex (Archbishop) on Jun 05, 2018 at 08:28 UTC
I used WWW:Mechanize This works too: `my @files = grep {!/choose one/i} $mech->form_number(1) ->find_input('jumpMenu2')->possible_values;` [download]	[reply] [d/l]
Re^2: Scrape Select Options from Web Page by BrentD (Sexton) on Jun 06, 2018 at 20:49 UTC
This isn't working. I think, possibly because the page designer doesn't have an explicit <form> tag in the page.	[reply]
Re^3: Scrape Select Options from Web Page by haukex (Archbishop) on Jun 06, 2018 at 21:35 UTC
I think, possibly because the page designer doesn't have an explicit <form> tag in the page. Yes, it does appear that WWW::Mechanize requires the `<form>` tags to recognize inputs. By your reply to my other solution, I assume you've already figured out that you can feed the HTML to Mojo::DOM with e.g. `my $dom = Mojo::DOM->new($mech->content);`.	[reply] [d/l] [select]