Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks! I have been looking around the Internet for a long time now, trying to find a code that I can use to (bulk) download some web pages, based on their URL. Unfortunately wget does not work, because many of the elements are JS or something and they don't get downloaded. Eseentially I need something that resembles the 'Save page as' functionality, but without me doing it.
Something like that, which, for some reason does not work for me (does not save the pages as it shows on the demo):
https://github.com/abiyani/automate-save-page-as

Example webpages that I am trying to download (>1000 of them):
https://opm.phar.umich.edu/proteins/7839 https://opm.phar.umich.edu/proteins/4676

where I change the numbers in the end.
Does any of you have some code that they might have used for a similar task, or some pointers at least on where to begin? Any help/advice would be very welcomed!

Replies are listed 'Best First'.
Re: Code for 'Save html page' that contains dynamic content?
by marto (Cardinal) on Jun 28, 2023 at 08:07 UTC

    While not a direct answer to your request for bulk downloads, are you aware of BioPerl? Introduction:"If you’re a molecular biologist it’s likely that you’re interested in gene and protein sequences, and you study them in some way on a regular basis. Perhaps you’d like to try your hand at automating some of these tasks, or you’re just curious about learning more about the programming side of bioinformatics. In this HOWTO you’ll see discussions of some of the common uses of Bioperl, like sequence analysis with BLAST and retrieving sequences from public databases. You’ll also see how to write Bioperl scripts that chain these tasks together, that’s how you’ll be able to do really powerful things with Bioperl.". Perhaps worth investigating, while not a bioinformatician I know of people who have used BioPerl to work with proteins and such like. I mention this as if the data is available this way it may be a better option than having to write the code to parse various sites/pages to get the data you want.

Re: Code for 'Save html page' that contains dynamic content?
by tobyink (Canon) on Jun 28, 2023 at 07:11 UTC

      yes, the API is at https://opm.phar.umich.edu/download. If the OP is serious about this they should start from there and build the crawler. That said, BioPerl was already mentioned and AFAIR R has packages for downloading some types of bio-data (not sure what).

      Thanks! I had not seen that. But this was just an example, because I have other ones as well, like:
      http://pdbtm.enzim.hu/?_=/pdbtm/1a0t

      where I want to check the coloured letters.
      In any case, if you have any suggestions (or code) that can be used for such tasks, it would be great :)
Re: Code for 'Save html page' that contains dynamic content?
by cavac (Prior) on Jun 28, 2023 at 12:37 UTC

    Click on the "Contact Us" link and ask them nicely if they can provide you a download link for the whole dataset. There's a chance nobody has asked them until now...

    PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP