nurulnad has asked for the wisdom of the Perl Monks concerning the following question:

I try to automate a download from a website. the link is http://www.pdb.org/pdb/explore/explore.do?structureId=2CU3

Where "2CU3" can be replaced by anything, in my code the variable for this is $input, and all the structureId's are contained within a text file named 'data.txt'.

After that, I want to get a link from that webpage. The link url is http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=FASTA&compression=NO&structureId=2CU3.

My problem is, when this link is downloaded, it's just junk if I open it in textedit. If I open it using TextWrangler, the content is fine. Any idea what is causing this and how to fix it?

My code is as follows:
#!/usr/bin/perl use strict; use WWW::Mechanize; open (FILE, "data.txt"); my $input; while ($input = <FILE>){ chomp $input; #download PDB html page my $url = "http://www.rcsb.org/pdb/explore.do?structureId="."$input"; my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get( $url ); #write extracted data to an output file (.html) my $file = "$input".".html"; print "$file"; use Data::Dumper; open (OUTFILE, "> $file"); print OUTFILE Dumper($mech); close(OUTFILE); #download the link (FASTA sequence) my $linkname = "fileFormat=FASTA&compression=NO&structureId="."$input +"; my @links = $mech->find_all_links( url_regex => qr/$linkname/ ); for my $link ( @links ) { my $url = $link->url_abs; my $filename = $url; $filename =~ s[^.+/][]; print "Fetching $url"; $mech->get( $url, ':content_file' => $filename ); print " ", -s $filename, " bytes\n"; } } close (FILE);
--------------------------------------------------------------------

Thanks for the replies. This isn't a TextEdit question, I can't manipulate the data that I get because they're junk. What I mean by junk is that instead of text, I get (*^%&&^(*&^(* sort of stuff. It perplexes me how this data can be seen properly on TextWrangler.

I'll try all your suggestions today. Thanks again!
  • Comment on using WWW::Mechanize to download a link, opened fine in TextWrangler but as junk in TextEdit
  • Download Code

Replies are listed 'Best First'.
Re: using WWW::Mechanize to download a link, opened fine in TextWrangler but as junk in TextEdit
by graff (Chancellor) on Aug 18, 2010 at 09:01 UTC
    When I tried your script (making up my own "data.txt" file with just the one "input" value you mentioned -- "2CU3"), I got two output files:

    One, called "2CU3.html", contained the output of Data::Dumper, as expected; this was a fairly big file, and probably not the one that you really intend to use with any sort of text editor.

    The other, whose file name was this long url string:

    downloadFile.do?fileFormat=FASTA&compression=NO&structureId=2CU3
    was pretty small (188 bytes), containing just the following four lines of plain text:
    >2CU3:B|PDBID|CHAIN|SEQUENCE MVWLNGEPRPLEGKTLKEVLEEMGVELKGVAVLLNEEAFLGLEVPDRPLRDGDVVEVVALMQGG >2CU3:A|PDBID|CHAIN|SEQUENCE MVWLNGEPRPLEGKTLKEVLEEMGVELKGVAVLLNEEAFLGLEVPDRPLRDGDVVEVVALMQGG
    This latter file was perfectly legible with TextEdit on my mac (osx 10.6.4) -- that is, when I just cat the file content out in a Terminal window (which is where the above lines were copy/pasted from), it looks the same as when I open it in TextEdit.

    I've never used TextWrangler (don't have it installed), so I don't know if it would look any different there. But I don't see a need to try, since TextEdit seems to be showing me the exact content of the FASTA file, without any trouble.

    So, what do you mean, exactly, when you say "it's just junk in textedit"?

    BTW, I would suggest that you change your method of coming up with an output file name for the FASTA files. Having question marks and ampersands in file names can be a real drag if you ever end up doing command-line shell operations on them. It might be sufficient just to add one line of code:

    $filename =~ s[^.+/][]; # you have this one already $filename =~ tr/?&/_/; # just add this one (turns all ? and & int +o _)
    If any of the urls ever contain a space, asterisk, semi-colon, exclamation mark, vertical-bar (|), parens, brackets, or single or double quotes, you'll want to add those to ? and & in the tr/// statement, as well.

    One last point -- I don't know for sure, but maybe if you added a ".txt" extension to the output FASTA file name, your TextEdit might behave better? (When I used TextEdit, it opened the file just fine as-is, but I could imagine the possibility of "user preferences" having some unexpected side-effect...)

      Thank you for the advice. I'll keep that in mind. You might have just saved days of work for me in the future by just telling me that.

      I appreciate you trying out the code. I wonder why it works for you and not for me. However I've tried another method, which is using the WWW::PDB module and that works like a charm.

Re: using WWW::Mechanize to download a link, opened fine in TextWrangler but as junk in TextEdit
by cdarke (Prior) on Aug 18, 2010 at 08:03 UTC
    This is really a question for TextEdit, rather than Perl. Presumably the file is in a format that TextWrangler can understand but TextEdit cannot, presumably FASTA. I don't know anything about FASTA format, but I note that there are a number of modules on CPAN in the Bio namespace which might help.
Re: using WWW::Mechanize to download a link, opened fine in TextWrangler but as junk in TextEdit
by Khen1950fx (Canon) on Aug 18, 2010 at 10:58 UTC
    I think that you were looking for something like WWW::PDB. This seems easier:
    #!/usr/bin/perl use strict; use warnings; use WWW::PDB qw(:all); WWW::PDB->cache('/user/Desktop/mech'); my $fh = get_structure('2cu3'); print while <$fh>;
      Thank you so much. Thank you so much!!