newbio has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks,

I am trying to do a pattern matching exercise. What I am trying to do:

Pick words from a file ($wordfile - each line contains a main term and its synonyms separated by a space) and see if those words (or their synonyms) appear in sentences of another file ($textfile - each line contains textID and associated sentences).

Below is my code.

1. I am getting the following error. When I comment "use warning" in the header this error goes off. What is this error and how to remove it?

main::ReadDataInHash() called too early to check prototype at D:\wordm +atch.pl line 19.
2. Also, I wanted to find out how I can improve upon my program so that it runs faster. Is there a way to avoid looping over each key of hash %{$List1Ref}, in, foreach my $p (sort keys (%{$List1Ref})). Any other faster way?

$wordfile SSN3 CDK8 GIG2 NUT7 RYE5 SRB10 UME5 . . $textfile 17170106|Perturbation of the activity of replication origin by meiosis + specific transcription.|We have determined the activity of all ARSs +on the Saccharomyces cerevisiae chromosome VI as chromosomal replicat +ion origins in pre-meiotic S-phase by neutral/neutral 2D gel-electrop +horesis. The comparison of origin activity of each origin in mitotic +and pre-meiotic S-phase showed that one of the most efficient origins + in mitotic S-phase, ARS605, was completely inhibited in pre-meiotic +S-phase. ARS605 is located within the ORF of Msh4 gene that is transc +ribed specifically during an early stage of meiosis. Systematic analy +ses of relationships between Msh4 transcription and ARS605 origin act +ivity revealed that transcription of Msh4 inhibited the ARS605 origin + activity by removing ORC from ARS605. Deletion of UME6 {{UME6}}, a t +ranscription factor responsible for repressing Msh4 during mitotic S- +phase, resulted in inactivation of ARS605 in mitosis. Our finding is +the first demonstration that the transcriptional regulation on the re +plication origin activity is related to changes in cell physiology. T +hese results may provide insights into changes in replication origin +activity in embryonic cell cycle during early developmental stages. . .
Thank you very much.

Raj

My code:

#!/usr/bin/perl use warnings; use strict; if ($#ARGV != 4) { print "usage: run batch file 'run' not this one\n"; exit; } my $wordfile = $ARGV[0]; my $textfile=$ARGV[3]; my $OutPutFile=$ARGV[4]; open (IF1,"$wordfile")|| die "cannot open the file"; open (PF, "$textfile")|| die "cannot open the file"; open (OF,">$OutPutFile")|| die "cannot open the file"; my $List1Ref=ReadDataInHash (*IF1); while (my $line=<PF>) { chomp($line); my @arrAbs=split (/\|/,$line); my $ID=$arrAbs[0]; my $Title=$arrAbs[1]; my $Abs=$arrAbs[2]; @arrAbs=split (/\./,$Abs); print OF"$ID|"; for (my $SentenceNumber=0;$SentenceNumber<=$#arrAbs ;$SentenceNumb +er++) { my $i=$SentenceNumber+1; print OF "<".$i.">"; my $Sentence=$arrAbs[$SentenceNumber]; my @arrAbsSen=split (' ',$Sentence); foreach my $word(@arrAbsSen) { #to match terms in the list, stored in %{$List1Ref}. if (exists(${%{$List1Ref}}{uc($word)})) { print OF "$word "; } else { foreach my $p (sort keys (%{$List1Ref})) { if (exists(${%{${%{$List1Ref}}{$p}}}{uc($word)})) +{ print OF "mainterm:$p:matchedterm:$word "; last; } } } } @arrAbsSen=(); } print OF "\n"; @arrAbs=(); } sub ReadDataInHash() { my $x = shift; my %list1=(); while (my $line =<$x>) { chomp $line; my @arr=split /\s/,$line; for (my $i=0;$i<=$#arr ;$i++) { if ($i==0) { $list1{$arr[$i]}={}; } else{ ${%{$list1{$arr[0]}}}{$arr[$i]} = 1; } } } return {%list1}; }
  • Comment on Error: "called too early to check prototype" and is word search using nested hash optimal?
  • Select or Download Code

Replies are listed 'Best First'.
Re: Error: "called too early to check prototype" and is word search using nested hash optimal?
by ikegami (Patriarch) on Jan 29, 2007 at 19:15 UTC

    You specified a prototype on a function (ReadDataInHash) that is called before it is declared.

    Since the prototype is wrong and unnecessary, change
    sub ReadDataInHash() { ... }
    to
    sub ReadDataInHash { ... }

Re: Error: "called too early to check prototype" and is word search using nested hash optimal?
by blazar (Canon) on Jan 30, 2007 at 15:22 UTC

    A few general remarks about your code:

    #!/usr/bin/perl use warnings; use strict;

    Good!

    if ($#ARGV != 4) { print "usage: run batch file 'run' not this one\n"; exit; } my $wordfile = $ARGV[0]; my $textfile=$ARGV[3]; my $OutPutFile=$ARGV[4];

    It doesn't seem to me that you use @ARGV[1,2] anywhere, and rather than relying on $#ARGV it would be clearer to rely on @ARGV's value in scalar context. Incidentally, I'd keep variable naming consistent and not mix camelcase with different conventions. Also, there's a specialized function for early exit from a program, thus all in all:

    die "Usage: run batch file 'run' not this one\n" unless @ARGV==3; my ($wordfile, $textfile, $outfile)=@ARGV;
    open (IF1,"$wordfile")|| die "cannot open the file"; open (PF, "$textfile")|| die "cannot open the file"; open (OF,">$OutPutFile")|| die "cannot open the file";

    The general rule of a thumb is to use low precedence logical operators for flow control. Also, we generally recommend using lexical filehandles and the three args form of open nowadays, althiough this is sometimes debated. Last, it would be nice to indicate in the error message which file failed to open and why. Oh, and there's no need to quote plain scalars. Thus:

    open my $wfh, '<', $wordfile or die "Can't open `$wordfile': $!\n"; open my $tfh, '<', $textfile or die "Can't open `$textfile': $!\n"; open my $ofh, '>', $outfile or die "Can't open `$outfile': $!\n";
    my $List1Ref=ReadDataInHash (*IF1);

    You only need the to read from the wordfile in that sub, thus I'd pass the filename instead. Also, I see that later on you only dereference $List1Ref, to use it like a regular hash, so why not use a hash from the beginning (and call it sensibly, e.g. %words) instead?

    my %words=ReadDataInHash($wordfile);

    I'll soon look into ReadDataInHash():

    sub ReadDataInHash()

    As others wrote, you're indicating an empty prototype, but that

    • differs from your actual usage (you pass an argument),
    • you use the sub before it's declared, which causes the error you're incurring into.

    Anyway prototypes are generally discouraged nowadays, except in certain situations. So just don't!

    { my $x = shift; my %list1=(); while (my $line =<$x>) { chomp $line; my @arr=split /\s/,$line; for (my $i=0;$i<=$#arr ;$i++) { if ($i==0) { $list1{$arr[$i]}={}; } else{ ${%{$list1{$arr[0]}}}{$arr[$i]} = 1; } } } return {%list1}; }

    IIUC, you read from a space separated file, and you want to create a HoH where the "primary key" is the first word of the line, and the "secondary one" is given the value of each of the remaining ones, and all the values are 1. As a comment to your code, it is at times rather clumsy because of excessive referencing and dereferencing, and you should know that in addition to C-style for loops you have Perl-style ones for iterating over a list. In this case I'd use the latter, for maximum clarity. Taking into account all that has been written above, I'd rewrite the sub like thus:

    sub ReadDataInHash { my $file=shift; open my $fh, '<', $file or die "Can't open `$file': $!\n"; my %words; while (<$fh>) { chomp; my ($first, @rest)=split; $words{$first}{$_}=1 for @rest; } %words; }

    Ok, here it seems you want to

    1. split on dots to get a list of "sentences",
    2. print a successive number,
    3. split each "sentence" on whitespace to get words,
    4. compare words with those got from your "words file" and print them if found as the main term or print an informative string if they are a synonim of some main term.

    I'm not entering into the details of your code, but

    1. removing unnecessary and not clarifying intermediate variables,
    2. simplifying some referencing-and-dereferencing acrobatics,
    3. removing some calls to exists (because the keys either do not exist or have a true value),
    4. using some lighter looping,

    I'd rewrite it like thus:

    Summing up, all in all I'd rewrite your script as follows:

    As a last note, since you only print to the output file ($ofh in my version of your script) you may have well wanted to select it. OTOH, since this is in your words "a pattern matching exercise", you may have wanted to save yourself the hassle of managing input and output files yourself too, and resort to the ARGV magic filehandle and print to STDOUT, just passing the words file as a first argument. This would simplify your exercise to:

    Here I didn't write a separate sub for reading from the words file into a hash, but I made a separate one for checking words, so that the main cycles can be simple enough to rely on default $_ without any risk of confusion. I also hid a loop into a map, which makes everything clearer IMHO. (Another minor change is that I removed the $title variable since you don't use it at all.) As a very last note, although many respectable Perl hackers would disagree, I think it would be clearer if instead of iterating over 0..$#sentences, one would maintain a counter him/herself:

      Thanks a lot Blazar for your detailed tips..:). That really helps me learn perl better. Thank you once again!