eMBR_chi has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Wise Perl Monks, I have this problem with pattern matching in an application I am trying to build. The script retrieves data from a database, and pushes these into appropriate arrays. One of those arrays contain abstracts from pubmed and the other two are lists of keywords to search for in the abstracts. Where the script finds cooccurence of those keywords in the abstract, it should then write the information into some other database tables. My problem is that apart from returning error of 'use of uninitialised values in pattern matching' the pattern matching component of the script just does not work. Being inexperienced in Perl I am unable to figure out what could be wrong. I would really appreciate your help. Thank you very much
#!/usr/bin/perl use strict; use warnings; use DBI; use lib'/usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/DBD/mys +ql.pm'; #retrieve names of bacteria and contaminants and store in arrays my $bactfilename = "Bacteria_names3.txt"; my @bact_names = get_data($bactfilename); my $contaminantname = "Contaminant_names4use.txt"; my @cont_names = get_data($contaminantname); #database details: my $ds = "DBI:mysql:Pmeddata:localhost"; my $user = "root"; my $passwd = "******"; #connect to database, prepare and execute SQL my $dbh = DBI->connect($ds,$user,$passwd) || die "Cannot connect to da +tabase!!"; my $sth = $dbh->prepare("SELECT pmid, abstract FROM PM_text WHERE titl +e LIKE '%E.coli%'"); $sth->execute; #arrays to hold pmid, and abstract my @abst_listing; my @abst_PMID; while (my @abstracts = $sth->fetchrow_array()){ push(@abst_PMID,$abstracts[0]); push(@abst_listing,$abstracts[1]); } $sth->finish; my $sth2 = $dbh->prepare("INSERT INTO PM_bacteria (bactname, assoc_con +t, pm_id) VALUES (?,?,?)"); my $sth3 = $dbh->prepare("INSERT INTO PM_cont (cont_name, pmed_id) VAL +UES (?,?)"); #WORKS WELL UP TO THIS POINT:THE FOLLOWING BIT DOES NOT WORK #use nested 'for' loops for pattern matching my $a; my $b; my $c; for ($a=0; $a<= scalar(@abst_listing); $a++){ for ($c=0; $c<= scalar(@cont_names); $c++){ if($abst_listing[$a] =~ m/$cont_names[$c]/im){ for($b=0; $b<= scalar(@bact_names); $b++){ if($abst_listing[$a] =~ m/$bact_names[$b]/im){ #insert into database; $sth2->execute($bact_names[$b],$cont_names[$c],$ab +st_PMID[$a]); $sth3->execute($cont_names[$c],$abst_PMID[$a]); print "matched at abst no $a , for $cont_names[$c] + and $bact_names[$b]\n"; } } } } } $sth2->finish; $sth3->finish; $dbh->disconnect; sub get_data{ my ($filename) = @_; unless (open(DATAFILE, $filename)){ print "Could not open file $filename!!\n"; exit; } my (@filedata) = <DATAFILE>; close(DATAFILE); return @filedata; }

Replies are listed 'Best First'.
Re: Some problem pattern matching
by toolic (Bishop) on Apr 03, 2010 at 02:56 UTC
    The reason for the "uninitialized" warnings is that you have an off-by-one error in your for loops. They attempt to go beyond the size of the array. You can prove this to yourself by printing the value of $a, for example, inside your outermost loop. The quickest fix is to change <= to <. For example, change:
    for ($a=0; $a<= scalar(@abst_listing); $a++){

    to:

    for ($a=0; $a< scalar(@abst_listing); $a++){

    A more Perl-ish way to create this for loop is as follows:

    for my $a (0 .. $#abst_listing){

    Notice that I declared the loop variable in the for statement itself. You should do the same for your b and c loops as well.

    If you had shown some actual data, it would be easier to figure out why you are not getting the matches you expect. Perhaps some other tips in the Basic debugging checklist will help you. If your arrays contain metacharacters, perhaps you need to use quotemeta for your regular expressions. You could try to chomp your arrays, then simplify your regexes by removing the //m modifiers.

    As a side note, your get_data sub is very similar to File::Slurp::read_file.

    use lib'/usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/DBD/mys +ql.pm';
    I don't think that does what you want. lib "is typically used to add extra directories to perl's search path", but you are trying to specify a module file. Perhaps this is what you meant:
    use lib '/usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi';

    Update: fixed typo (thanks amedico).

      Also, an even better way to write the array loops is:
      for my $cont_name (@cont_names) { # do stuff with $cont_name }
      Rather than:
      for ($c=0; $c < scalar(@cont_names); $c++) { # do stuff with $cont_names[$c] }
      When you don't specifically need the index number, it's best to avoid it altogether. Not dealing with the array index number makes your code more readable, more maintainable, and less error-prone.
      Typo... I think you mean:

      for ($a=0; $a<scalar(@abst_listing); $a++){

      Thank you very much. the codes do work well now. I handled the off-by-one error as advised. I also hadn't chomped the arrays earlier. Thanks a lot.

Re: Some problem pattern matching
by amedico (Sexton) on Apr 03, 2010 at 06:14 UTC
    You can simplify your code considerably by using the DBI "fetchrow_hashref" function (which gets each row as a reference containing the column names as keys and the column values as values) instead of storing the column data to separate parallel arrays. Something like this:
    my $sth = $dbh->prepare( "SELECT pmid, abstract FROM PM_text " . "WHERE title LIKE '%E.coli%'" ); $sth->execute; my @text_refs = ($sth->fetchrow_hashref()); $sth->finish; ... for my $text_ref (@text_refs) { for my $cont_name (@cont_names) { if ($text_ref->{abstract} =~ m/$cont_name/im) { for my $bact_name (@bact_names) { if ($text_ref->{abstract} =~ m/$bact_name/im) { print "matched at text $text_ref->{pmid}, " . "for $cont_name and $bact_name\n"; $sth2->execute( $bact_name, $cont_name, $text_ref->{pmid} ); $sth3->execute( $cont_name, $text_ref->{pmid} ); } } } } }
    Possibly you have a logic error here - the "$sth3->execute()" call depends only on the outer two loops, but it is occurring in the innermost loop. Thus if more than one bacteria name matches for a given abstract/contaminant pair, you will end up trying to insert duplicate records into that "PM_cont" table. Possibly the "$sth3->execute()" call should be moved out one level.

      Thanks amedico. I notice the improvement from using 'fetchrow_hashref'. Thanks. I'm currently 'having a go' at the 2nd suggestion, to see how it affects the entire scheme. Thanks a lot

Re: Some problem pattern matching
by jethro (Monsignor) on Apr 03, 2010 at 02:56 UTC

    Without the database it is difficult to run the script. So it might be better to help you debug than to just guess wildly. But nevertheless here is one wild guess: The line

    push(@abst_listing,$abstracts[1]);

    pushes into @abst_listing without checking the value in $abstracts[$1]. Maybe @abstracts has only one value, that would push the undefined value into @abst_listing, eventually generating an error message like you have seen.

    By the way, the error message you get normally tells the variable that is undefined as well as the line number. You should read those error messages with an alert mind and think about it, or if you ask questions somewhere, include this information.

    Now to debugging: There is a module called Data::Dumper which prints out data stored in variables of arbitrary complexity. Use that to print out values of variables, arrays, hashes and compare that with what you expect. For example try this:

    # at the top of your script: use Data::Dumper; ... my $a; my $b; my $c; # *** find out what the contents of your arrays is print Dumper(\@abst_listing,\@cont_names,\@bact_names); for ($a=0; $a<= scalar(@abst_listing); $a++){ for ($c=0; $c<= scalar(@cont_names); $c++){

    The backslash is there to make the output of Dumper easier to read. Run the script and maybe you will see something unexpected in one of your arrays. If you do, concentrate on the lines where this array was filled and try to imagine how such a result could have come about. Maybe put some more Dumper() calls there to check other variables. Or ask about it here, with such detailed information posted you should have a definite answer in minutes

      Thanks jethro. Indeed some of the variables being returned were undefs. Using Data::Dumper they became glaring. So I put in some code to turn the nulls into empty strings and that helped a lot. Thanks so much for this help.