Re: Question regarding web scraping

The following is a malformed regular expression:

while ($CONTENT =~ <div class=\"usertext-body may-blank-within md-cont
+ainer \"><div class=\"md\">(.+?)<\/div><\/div><\/form><ul class=\"fla
+t-list buttons\"> //gs )
[download]

It is at least missing the s/ start.

Personally, I suggest that you do the content extraction by using HTML::TreeBuilder and XPath or CSS selectors (via HTML::TreeBuilder::XPath and HTML::Selector::CSS).

Also note that Reddit has an API available, so you maybe don't need to scrape at all but can get the comments in a machine readable format directly.

Also note that on CPAN, there are many Reddit modules available, and it seems that Reddit::Client is using the Reddit API.

Comment on Re: Question regarding web scraping Select or Download Code

Replies are listed 'Best First'.
Re^2: Question regarding web scraping by Gangabass (Vicar) on Oct 23, 2016 at 05:42 UTC
I don't recommend to use HTML::TreeBuilder::XPath because it's really slow (on files > 1M). HTML::TreeBuilder::LibXML is much faster and have almost the same syntax.	[reply]
Re^2: Question regarding web scraping by Lisa1993 (Acolyte) on Oct 22, 2016 at 15:30 UTC
Thank you very much! I will look into these alternatives. Thanks again for your suggestions.	[reply]