multiple URL web scraping

Lisa1993 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I posted here a couple of weeks ago about a problem that I was having writing a simple web-scraping programme that scraped comments from the website Reddit, and allowed me to analyse them using computational linguistics methods.

Following the very helpful advice of the forum members here, I revised my code and switched from using LWP::Simple to Mojo:UserAgent.

The following code does almost exactly what I need it do (i.e. it downloads the comments and stores them in a way that is readable for my corpus software.

use Mojo::UserAgent;

my $url ='https://www.reddit.com/r/unitedkingdom/comments/58m2hs/i_dan
+ie+l_blake_is_released_today/.json';

my $ua = Mojo::UserAgent->new;
my $data = $ua->get( $url )->res->json;



foreach my $comment ( @{$data} ) {
    foreach my $child ( @{ $comment->{'data'}->{'children'} } ) {
    #output path needs changing
        open(OUT, ">>C:/Users/user/perl_tests/redresults221.txt");
        my $yprint = $child->{'data'}->{'body'} . "\n" if( $child->{'d
+ata'}->{'body'} );
        print OUT "$yprint";
        close(OUT); 

    }
}
[download]

However, I need to add two more elements to the code:

1) I need it to work so that it downloads multiple URLs either by a) letting me make a list of URLs to download within the code (e.g. through some kind of my @URLS = command) or b) the code itself can go into a separate .txt file containing the URL's that I am interested in and then run the existing code upon them (if that makes sense!?!)

2) I need to put a delay or sleep element in the code so that my IP address does not get flagged by the site. In my original programme I used the command sleep(int(rand(30))); should this still work with the Mojo:UserAgent library?

Thanks in advance for your help, it is really appreciated. What you guys do for us beginners is pretty extraordinary!

EDIT: In my original post I should have made it clear that the code that I posted was very kindly written by marto, I never meant to imply that I had wriiten it myself, but I can understand that my wording was very careless.

I now have a working code for my problem, based on three codes that were generously written for me by marto, Athanasius and stevieb:



use Mojo::UserAgent;

my @urls = qw(
    https://www.example1.com.json
        https://www.example2.com.json
    https://www.example3.com.json
);

for my $URL (@urls){
    my $ua = Mojo::UserAgent->new;
    my $data = $ua->get( $URL )->res->json;

sleep(int(rand(60)));

foreach my $comment ( @{$data} ) {
    foreach my $child ( @{ $comment->{'data'}->{'children'} } ) {
    #output path needs changing
        open(OUT, ">>C:/Users/user/perl_tests/redresults805.txt");
        my $yprint = $child->{'data'}->{'body'} . "\n" if( $child->{'d
+ata'}->{'body'} );
        print OUT "$yprint";
        close(OUT); 

    }
}
}
[download]

Comment on multiple URL web scraping Select or Download Code

Replies are listed 'Best First'.
Re: multiple URL web scraping by haukex (Archbishop) on Nov 02, 2016 at 14:21 UTC
Hi Lisa1993, It sounds like you're fairly new to Perl, so I'd suggest you have a look at perlintro which goes over many of the basics. In this case a loop, something like `foreach my $url (@urls) { ... }` might be appropriate for you. It also covers reading from files, have a look at the Files and I/O section, which shows you how to use open, while and close. A tip, since that section doesn't mention it, have a look at chomp too. As for your second question, yes, sleep, int, and rand are Perl built-in functions, so you can always use them anywhere. Hope this helps, -- Hauke D	[reply] [d/l]
Re^2: multiple URL web scraping by Lisa1993 (Acolyte) on Nov 03, 2016 at 08:09 UTC
Thank you very much for the helpful advice and links!	[reply]
Re: multiple URL web scraping by Corion (Patriarch) on Nov 02, 2016 at 14:14 UTC
Re^3: Question regarding web scraping seems to implement such a loop. Does that code for you or where are you experiencing problems? Please note that this is not a code writing service. Maybe you want to investigate how to read lines from a text file in Perl.	[reply]
Re^2: multiple URL web scraping by Lisa1993 (Acolyte) on Nov 03, 2016 at 08:45 UTC
Thank you very much! The loop wasn't working for me, so I had assumed that it was a coding difference between libraries. However, following your advice, I went back and double checked, and I had made the smallest mistake. Thanks again for your help, and sorry if this post was inappropriate: you guys have helped me so much and I wouldn't want to annoy any of you!	[reply]
Re^3: multiple URL web scraping by haukex (Archbishop) on Nov 03, 2016 at 09:08 UTC
Hi Lisa1993, Thanks again for your help, and sorry if this post was inappropriate We get a lot of posts from people with thinly veiled requests of "do my (home)work for me for free". To set yourself apart from those, I'd recommend you follow the advice in these links: How do I post a question effectively?, Short, Self Contained, Correct Example, and Basic debugging checklist, and what I think is most important, show your own efforts! If you've solved your own problem, you could post an update showing how you solved it, so that other wisdom seekers may learn from your question and its resolution. I think most monks are very happy to help those who want to learn! Regards, -- Hauke D	[reply]
Re^4: multiple URL web scraping by Lisa1993 (Acolyte) on Nov 03, 2016 at 09:23 UTC
Re: multiple URL web scraping by marto (Cardinal) on Nov 02, 2016 at 14:25 UTC
As I suggested earlier, create a loop for the URLs, sleep for a few seconds between each. Which part is problematic?	[reply]
Re^2: multiple URL web scraping by Lisa1993 (Acolyte) on Nov 03, 2016 at 08:49 UTC
Apologies, I had made a stupid mistake in my rendition of the coding, and it was throwing the whole loop out. Sorry for wasting your time, and thank you for all your help.	[reply]
Re^3: multiple URL web scraping by marto (Cardinal) on Nov 03, 2016 at 09:31 UTC
I don't see a while loop, the code you posted doesn't have any additional URLs to scrape. It's always better to post the code you are actually running when asking for help. One thing I would suggest with your code, use 3 argument open, open/Three-arg open().	[reply]
Re^4: multiple URL web scraping by Lisa1993 (Acolyte) on Nov 03, 2016 at 09:39 UTC
Re: multiple URL web scraping by Anonymous Monk on Nov 02, 2016 at 14:41 UTC
"Following the very helpful advice of the forum members here, I revised my code and switched from using LWP::Simple to Mojo:UserAgent." - but in reality, someone posted working code, you're now saying you did this?	[reply]
Re^2: multiple URL web scraping by Lisa1993 (Acolyte) on Nov 03, 2016 at 08:08 UTC
Oh, sorry! I never meant to imply that I had written the code, I was simply trying to explain that I had revised my original plan (in which I used LWP::Simple) to using Mojo::UserAgent, and that was why I was posting a very similar question. However, I recognise now that my wording was very careless. Thanks for calling me out on it: I have now amended my post and will be more thoughtful in the future. Again, apologies for any offense.	[reply]