Nicpetbio23! has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am new to perl scripting. Is it possible to create a script the counts the number of times that each string in one array matches a regex containing that string in another array. Thanks!

Replies are listed 'Best First'.
Re: Counting matches
by AnomalousMonk (Archbishop) on May 28, 2017 at 23:42 UTC
Re: Counting matches
by kcott (Archbishop) on May 29, 2017 at 05:56 UTC

    G'day Nicpetbio23!,

    Welcome to the Monastery.

    As already pointed out, you'll need to be more specific to get the best answers; however, the following might get you started.

    #!/usr/bin/env perl use strict; use warnings; use Data::Dump; my @src_strings = qw{A ABA ABBCCC AAABBCD BCDD CCD D}; my @re_strings = qw{A B C}; my %count; for my $src (@src_strings) { for my $re (@re_strings) { $count{$src}{$re}++ while $src =~ /$re/g; } } dd \%count;

    Output:

    { A => { A => 1 }, AAABBCD => { A => 3, B => 2, C => 1 }, ABA => { A => 2, B => 1 }, ABBCCC => { A => 1, B => 2, C => 3 }, BCDD => { B => 1, C => 1 }, CCD => { C => 2 }, }

    Without seeing your data, that solution could be terribly slow, or may not work (in some way or another).

    See also: perlintro (and the links it provides to more detailed information); and Data::Dump (the builtin Data::Dumper module provides equivalent functionality).

    — Ken

Re: Counting matches
by shmem (Chancellor) on May 29, 2017 at 07:56 UTC

    Applying each regex in one array to each string in another array:

    my @s; # string array my @r; # regex array my %m; # matches hash %m = map { my $k = $_; $_, { map { my $s =()= $k=~/$_/g; $_,$s } @r }; } @s;

    The =()= thingy? It forces list context on the right hand side and returns the number of elements of that list for a scalar on the left hand side, if the left hand side is a scalar. See also Re: Hidden features of Perl (More Secret Operator References)

    For

    counts the number of times that each string in one array matches a regex containing that string in another array

    which I read as matching the regex against the string, just swap the arrays.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Counting matches
by KurtZ (Friar) on May 28, 2017 at 23:45 UTC
    Yes, it's possible. But your question is fuzzy.

    Could you show us what you tried our at least sample data?

Re: Counting matches
by AnomalousMonk (Archbishop) on May 29, 2017 at 13:17 UTC

    A couple of nice stabs at answers from kcott++ and shmem++, but I think Nicpetbio23! has wandered away!


    Give a man a fish:  <%-{-{-{-<

      Hi, I want to read the file Genomes_used_Hant.txt into an array and look for each element in the array in NRT2.txt and return a count. For example Gloin1 is at line 24 in the Genomes_used_Hant.txt:
      Gloin1
      Below is an example of what is in the NRT2.txt file.
      >Gloin1_46659 MVKLFARPLPIDP.... >Gloin1_30454 MIKLFDKPSKELS....
      I would like to return the following in an output file for each element in the array. Since this is not an exact match I expect I need to use a regex.
      Gloin1: 2 occurrences in NRT2.txt

        Here is an SSCCE which matches your spec. Enjoy.

        #!/usr/bin/env perl use strict; use warnings; use Test::More; my @hant = ('Gloin1'); my @counts = (2); plan tests => scalar @hant; my $nrt2 = <<EOT; >Gloin1_46659 MVKLFARPLPIDP.... >Gloin1_30454 MIKLFDKPSKELS.... EOT for my $pat (@hant) { my $matches = () = $nrt2 =~ /$pat/g; is ($matches, shift @counts, "Correct number of matches found for +$pat"); }

        Now that you've provided an indication of your data, a much better solution (than my earlier tentative suggestion) presents itself.

        Assuming you have a filehandle, e.g. $matches_fh, to your file of match data (Genomes_used_Hant.txt in your example); and another, e.g. $fasta_fh, to your fasta data (NRT2.txt in your example); you can capture the wanted counts like this:

        my $alt = join '|', reverse sort <$matches_fh>; my $re = qr{(?x: ^ > ( $alt ) )}; my %count; /$re/ && ++$count{$1} while <$fasta_fh>;

        The code you've presented, in a couple of your posts in this thread, use the 3-argument form of open with lexical filehandles: this is very good. You are not, however, checking for I/O errors: this is not good at all. The easiest method is to let Perl do this checking for you with the autodie pragma; the alternative is to do this yourself, as shown in the open documentation.

        In the test code below, I've used Inline::Files purely for convenience. The count information is in %count: you can format and output this however you want.

        #!/usr/bin/env perl use strict; use warnings; use Data::Dump; use Inline::Files; my $alt = join '|', reverse sort <MATCHES>; my $re = qr{(?x: ^ > ( $alt ) )}; my %count; /$re/ && ++$count{$1} while <FASTA>; dd \%count; __MATCHES__ Gloin1 XYZ1 XYZ XYZ12 __FASTA__ >Gloin1_1 unwanted data >XYZ_1 unwanted data >XYZ12_1 unwanted data >XYZ1_2 unwanted data >XYZ1_1 unwanted data >XYZ12_3 unwanted data >Gloin1_2 unwanted data >XYZ12_2 unwanted data

        Output:

        { Gloin1 => 2, XYZ => 1, XYZ1 => 2, XYZ12 => 3 }

        — Ken

        Something along these lines then:

        use strict; use warnings; my $inputFile = q{/path/to/NRT2.txt}; open my $inputFH, q{<}, $inputFile or die qq{open: < $inputFile: $!\n}; my %occurs; while ( <$inputFH> ) { $occurs{ $1 } ++ if m{^>([^_]+)} } close $inputFH or die qq{close: < $inputFile: $!\n}; print qq{$_: $occurs{ $_ } occurrences in $inputFile\n} for sort keys %occurs;

        A more comprehensive example of your input file would be needed to be sure of the solution.

        Update: Too simplistic, ignore this as examples of both files are needed before making a stab at a solution.

        Cheers,

        JohnGG