comment on

Hi All, I have a question in pattern match. I have different set of samples like CGPP5048286, WGA_PD4005a,1710STDY5035576, PD4005a, PD4005b, PD4005c. I have a script that works and finding the identity between the samples. I want to pattern match the sample names to find the same sample or its identical one like PD4005b, PD4005c are matched.

for (my $j = 0; $j<=scalar(@sam2com);$j++){

    my $s1 = $sam2com[ $j ] ;
    my $geno1 = $source_set->{$s1};
    
    my $top_percent = 0;
    my $top = '';
    for (my $k = 0;$k<=scalar(@sam2com);$k++){
        
        my $match = 0;
        my $s2 = $sam2com[ $k ];
        my $geno2 = $source_set->{$s2};

        my $set = Array::Each->new(\@$geno1, \@$geno2);

        while (my($g1, $g2, $index2) = $set->each() ){

            #print "$s1|$s2|$g1|$g2|$index2\n";
            next if $g1 eq "" || $g2 eq "";
            next if $g1 =~ /^NN/i || $g2 =~ /^NN/i;
            if($g1 eq $g2){

                $match++;
                #print "$s1|$s2|$g1|$g2|$match|$index2\n";
            }#end of if $g1 eq $g2
        }#end of while loop $set->each
        
        my $percentage = sprintf "%.2f", ($match * 100)/( scalar @$gen
+o1 ) ; 
        
        print SUM $percentage, ",";
        next if ($percentage < 75);
        #print "$s1|$s2|$percentage\n";
        if( ( $percentage >=$top_percent) and ($top ne $s1 ) ){
        
            $top_percent= $percentage;
            $top = $s2;

        } #end of if $percentage >=$top_percent) and $top ne $s1
        
        push @{ $com_sam->{ $s1 }->{ $percentage } }, {
            sample  =>$s2,
            percent =>$percentage,
            match   =>$match


        };


        my $ge1 = join "", @$geno1;
        my $ge2 = join "", @$geno2;

        if( ( $ge1 eq $ge2 ) and ( $s1 ne $s2 ) ) {

            print LOG "$s1|$s2|$ge1  |  $ge2\n";
            
        }

    }#end of for $k sam2com
    
print SUM "\n";

#sort by percentage in desending order. Get the samples match to other
+ sample percentage of match and top hit .
    foreach my $percent ( sort { $b <=> $a } keys %{ $com_sam->{ $s1 }
+ } ){
    
        my $match_samples = $com_sam->{ $s1 }->{ $percent };
        
        foreach my $matSam( @ { $match_samples } ){
            
            if( ( $s1 ne $matSam->{ sample } ) and ($matSam->{ percent
+ } >= $top_percent) ) {#check the sample1 matches with a different sa
+mple with a higher percntage.

             
                 print LOG  "Sample $s1 matches with $matSam->{ sample
+ } with $matSam->{ percent }\n";
                
                
                
                
            }    #else{

                
                my $l = sprintf "%s, %s, %0.2f, %s, %0.2f ", $s1, $mat
+Sam->{ sample }, $matSam->{ percent }, $top, $top_percent;
                print OUT $l,"\n";

            #}
        }#end of forach $match->sample

    }#end of percentage foreach loop

}#end of for $j @sam2com.
[download]

I dont want to include the samples PD4005a, PD4005b, PD4005c or WGA_PD4005a|b|c in the LOG file. Since they are identical samples. Any way of doing this? I tried

my ($n, $m, $o) = $s1 =~ /^(PD|WGA_PD)(\d+)(a|b|c)/;
155       my ($n1, $m1, $o1) = $matSam->{ sample } =~ /^(PD|WGA_PD)(\d
++)(a|b|c)/;

and comparing the $m ==$m1  and also 
/(\w+)(a|b|c). and comparing the $1 of one sample with the other. But 
+surely, I am making some stupid mistakes. as they still do come in th
+e LOG file.
[download]

Any suggestion please. Thanks

In reply to pattern match with different sets. by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.