Nyro46 has asked for the wisdom of the Perl Monks concerning the following question:

Okay so before anything, I'm not really that experienced with coding. So bear with me.

I'm trying to use perl to export a bunch of images from a Wikia site at once, so I can upload them to another wiki. I've been following this here: https://www.mediawiki.org/wiki/Exporting_all_the_files_of_a_wiki

I'm currently stuck on step 3, where I'm guessing it's supposed to be fetching all the direct file download links. The problem is, every time it just keeps listing "b/bc/Wiki.png" (which, is a file on the wiki, it's the logo seen when you put it in monobook, but it has nothing to do with the files I'm trying to download) and sometimes it throws in a "404 not found" error every once in a while. Also, the first couple times I let it go all the way through, and it stopped at 417, but there are just over 700 files in total.

I'll post the code I have saved as a .pl file at the moment in case I have something written down wrong (I will cut out a chunk of the file list though):

use strict; use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; my @myFileName=(''); $myFileName[0]="Kniro-Lippies V.6 Concept.JPG"; $myFileName[1]="Kniro concept thing.png"; $myFileName[2]="Kniro og.png"; ... $myFileName[700]="Theta's redesign.jpg"; $myFileName[701]="THETA..jpg"; $myFileName[702]="Lippies Book 8 Page 10.jpg"; my $agentName="User:Nyro_the_Leopard (http://lippies.shoutwiki.com/wik +i/User:Nyro_the_Leopard) grabbing some data using ExtractImages.pl"; my $browser = LWP::UserAgent->new(); $browser->timeout(500); my $string='crappyfartsgohome/images/'; my $endString='"'; my $position=0; my $endPosition=0; #my $prefix='http://vignette.wikia.nocookie.net/crappyfartsgohome/imag +es/; my $prefix=''; my $delimiter="\n"; my $reject1='OKAY_I_SERIOUSLY_CANNOT.png);'; my $reject2='Yum yum.jpg'; my $newArrayIndex=0; for (my $count=0; $count<=417; $count++){ my $url="http://crappyfartsgohome.wikia.com/wiki/File:".$myFileNam +e[$count]; my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); if ($response->is_error()) {printf "%s\n", $response->status_line; +} my $contents = $response->content(); $position=index($contents,$string,0)+length($string); $endPosition=index($contents,$endString,$position); my $fileName=substr($contents,$position,$endPosition-$position); if ($position!=-1 && $fileName ne $reject1 && $fileName ne $reject +2){ #print $prefix.$fileName.$delimiter; print '$myFileName['.$newArrayIndex.']="'.$fileName.'";'.$deli +miter; $newArrayIndex++; } }

for the "my $agentname" thing I'm guessing to just put my username and profile from the wiki I'm going to be uploading the files too? I don't know but I don't think it really matters since it was in the previous code I used to get the file name list that's there now.

I'm not sure if "my $prefix" should be the one that's there right now, since that's the first part of the direct download links to files on that wiki.

I tried taking the pound symbol away from in front of the "my $prefix" part, which then terminal gave me errors about the "my $reject1" and "my $reject2". Right now I just put the file names of two stupid images on the wiki because I honestly have no idea what's supposed to be there. On the mediawiki example it had "LiberterianWiki.gif" and "icons/fileicon-pdf.png". I tried putting those in (though changed LiberterianWiki to CrappyFartsGoHomeWiki) but it still gave me the same errors. I'm not even sure if the pound symbol in front of "my $prefix" is part of the original problem or not, and if it's not supposed to be there, well then good, but now I need to figure out what is supposed to be there for the reject things.

If anyone is able to help me with this, it's very much appreciated. I want to be able to actually succeed at something coding-related too. (Also yeah don't ask about the titles of my wikis ...)

Replies are listed 'Best First'.
Re: I keep getting "b/bc/Wiki.png" instead of the actual thing I want
by tangent (Parson) on Apr 06, 2016 at 22:29 UTC
    What you get back from that URL is a full HTML page so I think you will need to bring in a HTML parser. I guess that you are looking to get at the actual image file and this is how you could get the URL for that using HTML::TreeBuilder::XPath
    use HTML::TreeBuilder::XPath; # ... for my $count ( 0 .. 417 ) { my $url = "http://crappyfartsgohome.wikia.com/wiki/File:".$myFileN +ame[$count]; my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); if ($response->is_error()) {print $response->status_line, "\n";} my $contents = $response->content(); my $tree = HTML::TreeBuilder::XPath->new; $tree->parse($contents); $tree->eof; my @links = $tree->findnodes('//div[@class="fullImageLink"]/a'); my $image_link = $links[0]; my $image_url = $image_link->attr('href'); print "$myFileName[$count]\n$image_url\n\n"; next if $image_url =~ m/$reject1/; next if $image_url =~ m/$reject2/; # do stuff with image URL }
    OUTPUT:
    Kniro-Lippies V.6 Concept.JPG http://vignette2.wikia.nocookie.net/crappyfartsgohome/images/b/b3/Knir +o-Lippies_V.6_Concept.JPG/revision/latest?cb=20140714143948 Kniro concept thing.png http://vignette3.wikia.nocookie.net/crappyfartsgohome/images/4/43/Knir +o_concept_thing.png/revision/latest?cb=20140714143947 Kniro og.png http://vignette3.wikia.nocookie.net/crappyfartsgohome/images/3/3e/Knir +o_og.png/revision/latest?cb=20140714143946
    Note that I changed $reject1 to 'OKAY_I_SERIOUSLY_CANNOT.png'

      Okay ... I installed the HTML treebuilder thing through cpan or whatever.

      Am I supposed to take everything from mine and replace the "# ..." part? I still get some errors (especially around the rejects)

      Are you able to paste your entire input including the other part from mine so I can check to make sure I have it typed right? Thanks :)

        You could paste yours and we can tell if you typed it right ;)