cboPerl has asked for the wisdom of the Perl Monks concerning the following question:

I have two fastafiles
I want to compare both and look for best hits:
I am using the following code:

#!/usr/bin/perl -w use strict; open (IN, ARGV[0]); my %assem = (); while(my $line = <IN>){ chomp $line; my @parseblast = split(/\s+/, $line); $parseblast[0] =~ s/\s+//g; $parseblast[1] =~ s/\s+//g; print $parseblast{0}."\n"; $assem{$parseblast[0]} = $parseblast[1]; } close IN; open(IN, $ARGV[1]); my %clc = (); while(my $line = <IN>){ chomp $line; my @assemblast = split(/\s+/, $line); $assemblast[0] =~ s/\s+//g; $assemblast[1] =~ s/\s+//g; print $assemblast[0]."\n"; $clc{$assemblast[0]} = $assemblast[1]; } close IN; my $c=0 while(my $k = each(%assemblast)){ if($clc{$assemblast{$k}} eq $k){ $ c++; } } print $c;

when running:

perl compareAssemblies.pl Assembly1.fa Assembl2.fa

the program generates the following:

syntax error at compareAssem.pl line 4, near "ARGV[" Global symbol "%parseblast" requires explicit package name at compareA +ssem.pl line 11. Global symbol "%assemblast" requires explicit package name at compareA +ssem.pl line 29. syntax error at compareAssem.pl line 29, near "){" Global symbol "%assemblast" requires explicit package name at compareA +ssem.pl line 30. Global symbol "$k" requires explicit package name at compareAssem.pl l +ine 30. Global symbol "$k" requires explicit package name at compareAssem.pl l +ine 30. Global symbol "$c" requires explicit package name at compareAssem.pl l +ine 31. Global symbol "$c" requires explicit package name at compareAssem.pl l +ine 34. Execution of compareAssem.pl aborted due to compilation errors.

Please could someone help me to resolve these errors?

Replies are listed 'Best First'.
Re: Compare fasta files
by davido (Cardinal) on Nov 20, 2016 at 03:12 UTC

    I started trying to provide comments inline in your code to explain the error messages you were seeing, the bugs that weren't showing up in the error messages, and to introduce a few best practices that don't manifest as bugs but need to be considered. But the code became illegible. So here's a few areas to focus on:

    • Read perlintro, strict, perlsub, and my to gain an understanding of why you are using strict near the top of your program, and what that means to you. It's rarely a good idea to just use a line of code without knowing what it is for. Once you understand what it is for, you will quickly see the reason for many of the error messages you're seeing.
    • Read open and perlopentut to understand "three-arg open, and why you should be using it. Also to understand the importance of, and how to accomplish checking the return value of system calls such as open and close. Your code actually commits the big no-no of accepting user input (via @ARGV) and then passing it through to the shell (via open IN, $ARGV[0]). If you trust the user, which may be just you anyway, it's probably not such a big deal, but if this script were ever to be invoked by a web-app, look out.
    • In perlintro and perldata you can gain an understanding of the difference between Perl's different data types, or containers. A hash starts with a %, and is indexed as $hash{key}. And an array starts with a @, and is indexed as $array[index]. Declaring my @array and then later calling $array{index} is actually looking in %array for an element. They are distinct and different containers.
    • Statements are delimited with a trailing semicolon. The semicolon may be omitted at the last statement in a block, and at the last statement in a program, but often are not omitted because it's just a good habit to hit that ; when your statement terminates.
    • Although Perl will often let you put whitespace between the sigil and the beginning of an identifier, it's sometimes ambiguous, and always difficult for the reader of the code later on. $foo, please, not $ foo. You'll thank yourself later (or curse yourself the first time $ foo causes a difficult to find bug)
    • Variables always start with a sigil. When referring to an element in an aggregate container, that sigil is $. When referring to a scalar variable that sigil is also $. When referring to an aggregate container as a whole, the sigil is @ or %, depending on the type of container. And when taking a slice of an aggregate container, the sigil is @. Disavow any knowledge of * and & for now. You'll use & later when you need it, and on rare occasion *, but for now you don't need them. ;) @ARGV also starts with a sigil, and when indexing its elements, start with $.

    Dave

Re: Compare fasta files
by johngg (Canon) on Nov 20, 2016 at 00:11 UTC

    At first glance:-

    • $parseblast1 should be $parseblast[1]

    • $parseblast{0} should be $parseblast[0]

    • $ARGV1 should be $ARGV[1]

    • $assemblast1 should be $assemblast[1]

    • [doc://each] returns key <b>and</b> value, I suspect you might need [doc://keys]

    • each(%assemblast) should be each(%assem)

    Further points:-

    • You should indent your code to make it easier to determine the logic employed.

    • Use the three-argument from of open, lexical file handles and check for success, e.g. open my $fh, '<', '/path/to/file' or die "open failed: $!\n";

    They may be other problems but clear up the ones above and the debugging task will get smaller.

    Update: OP added code tags, fixed some of the subscripting problems and indented code while I was posting so much of my advice is moot.

    Cheers,

    JohnGG

      Update: OP added code tags and indented code while I was posting so my advice to indent code is moot.

      Not quite. One of the Janitors fixed the formatting, adding code tags and paragraph or br tags to make what your browser renders appear more like what the OP seemed to intend. In so doing, it was also necessary to remove </ ... >, which was inexplicably wrapping every line of code. In the "edit" mode one can see how text was entered, making it possible, though often labor intensive to decipher and restore original intent.


      Dave

Re: Compare fasta files
by GotToBTru (Prior) on Nov 20, 2016 at 03:35 UTC

    It appears that comparing the files consists in comparing the first two elements in each line. This doesn't make sense based on the information I have seen about the format of FASTA files. Perhaps you can provide a short sample of two of the files you will be working with?

    Also, you can't use simple scalar comparisons with hashes. Modules like Data::Compare can help. See also compare hash.

    But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

      I am posting the header of one of the .fasta file:
      >167440 TCONS_00167441 scaffold_2269+ 284-1043 AGGGCTCAAGCTTTATTTCACGTAGCTGACTTTACCGTCAGCTCAATTGGAATAGTTTTT CGCTATGTTCGCAGGCAAGTGAGACGATCCATCAATGCCCTTATCTGCTTCGAAAGAACC GGTGTCATCCAAACATGGTGAAGAGGTGGCAACTGGATCAATAATAGCTGAAACTTCTAC TGTACAGGGTTCGGCTTGCCCAACTGTCCAAGCTTGAGATCTATTTTAGAATATGCTTAA CACAACACATGCAATTCGAACGTTGTTTTCTCGGAAAGATTTGAAAGTAACTCCGTTGGG TTCAATGCCCGCTAGTCCCATGCATCCTTTCTGTTGGTCAACAACCAACCACAAGTCAAT CGAATGAATTCTTCAAGACTCCGGACTCTCTTTCTGTCCGGAGGGAATCATTGTTTCTCA ATCAATCATGCCTCAACTGGATAAATTCACTTATTTCACACAATTTTTCTGGTCATGCCT TTTCCTCTTTACTTTTTATATTCCCATATGCAATGATGGAGATGGAGTACTTGGGATCAG

        Your code does not account for the header line, nor would it work on the sequence data since it contains no white space. The replies to How to get non-redundant DNA sequences from a FASTA file? might provide some good insight into how to work with your files. There are packages Bio::Perl and Bio::SeqIO that you might find useful.

        In general, I strongly suggest you use Super Search and search for FASTA. I'd suggest restricting the search to root nodes (there are radio buttons to exclude replies). See what your colleagues have been asking, because questions about FASTA files come pretty regularly here.

        But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

Re: Compare fasta files
by Cristoforo (Curate) on Nov 19, 2016 at 23:33 UTC
    I can't read this. Please enclose your code in <c> ... </c> tags.

    Some of your errors seem to be using your array @assemblast as a hash.