trudy123g has asked for the wisdom of the Perl Monks concerning the following question:

I have a tab-delimited information file, e.g.:
THAETURQU .LLL..RRR JURYATAUTIOPW .LLL...LL.RRR TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL
I would really like to be able to pick out the letters from column 1 that correspond to specific combinations from the string in column 2. e.g. for ".LLL"
THAETURQU .LLL..RRR THAE JURYATAUTIOPW .LLL...LL.RRR JURY TURIWOWQNQQUTYRIOPWMF L..L..LLL.LL.R..L.LLL OWQN,PWMF
Is there an easy way to provide a list of combinations, such as a file:
.LLL LLLL .RRR RRRR
that would output the corresponding letters in from column 1 in subsequent columns headed, ".LLL", "LLLL", ".RRR", "RRRR"?

I suspect this is really simple, but it would be really helpful, thanks!

Replies are listed 'Best First'.
Re: Extracting string based on comparison to second string?
by kennethk (Abbot) on May 19, 2010 at 21:36 UTC
    What have you tried? What didn't work? See How do I post a question effectively?. In general, we can be more helpful when we can see where you've had difficulty and can gauge your experience.

    As I understand your spec, you can use regular expressions to locate where in your masks your codes fall and then use the special variables @- and @+ along with substr to extract the appropriate characters. Something like:

    #!/usr/bin/perl use strict; use warnings; my @searches = qw( .LLL LLLL .RRR RRRR ); my @file = <DATA>; chomp @file; for my $search (@searches) { for my $line (@file) { my ($string, $mask) = split /\s+/, $line, 2; my @matches; while ($mask =~ /\Q$search\E/g) { my $start = $-[0]; my $end = $+[0]; push @matches, substr $string, $start, $end - $start; } local $" = ","; print "$line\t\t@matches\n"; } } __DATA__ THAETURQU .LLL..RRR JURYATAUTIOPW .LLL...LL.RRR TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL

    Note I used \Q and \E to escape metacharacters since . has special meaning in regular expressions. See perlretut on info on how to build regular expressions.

Re: Extracting string based on comparison to second string?
by ikegami (Patriarch) on May 19, 2010 at 22:38 UTC
    Should "LL" result in
    THAETURQU .LLL..RRR HA,AE JURYATAUTIOPW .LLL...LL.RRR UR,RY,UT TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL WQ,QN,NQ,MN,NH,HF
    or
    THAETURQU .LLL..RRR HA JURYATAUTIOPW .LLL...LL.RRR UR,UT TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL WQ,NQ,MN,HF

    kennethk assumed the latter. The following handles the former:

    #!/usr/bin/perl use strict; use warnings; my @patterns = qw( .LLL LLLL .RRR RRRR LL ); chomp( my @file = <DATA> ); for my $pat (@patterns) { print("pattern = $pat\n"); our $string; our @matches; my $re = do { use re 'eval'; qr/ \Q$pat\E (?{ my $start = $-[0]; my $end = $+[0]; push @matches, substr($string, $start, $end - $start); }) (?!) /x }; for my $line (@file) { (local $string, my $mask) = split(/\s+/, $line, 2); local @matches; $mask =~ $re; local $" = ","; print "$line\t@matches\n"; } print("\n"); } __DATA__ THAETURQU .LLL..RRR JURYATAUTIOPW .LLL...LL.RRR TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL
    pattern = .LLL THAETURQU .LLL..RRR THAE JURYATAUTIOPW .LLL...LL.RRR JURY TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL OWQN,WMNH pattern = LLLL THAETURQU .LLL..RRR JURYATAUTIOPW .LLL...LL.RRR TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL WQNQ,MNHF pattern = .RRR THAETURQU .LLL..RRR URQU JURYATAUTIOPW .LLL...LL.RRR IOPW TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL pattern = RRRR THAETURQU .LLL..RRR JURYATAUTIOPW .LLL...LL.RRR TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL pattern = LL THAETURQU .LLL..RRR HA,AE JURYATAUTIOPW .LLL...LL.RRR UR,RY,UT TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL WQ,QN,NQ,UT,MN,NH,HF
Re: Extracting string based on comparison to second string?
by johngg (Canon) on May 19, 2010 at 23:01 UTC

    This solution is similar to kennethk's but uses a look-ahead in the regular expression and pos to find where the match occurred rather than special variables.

    use strict; use warnings; use 5.010; my @searches = qw{ .LLL LLLL .RRR RRRR }; chomp( my @lines = <DATA> ); foreach my $search ( @searches ) { my $rxSearch = qr{(?=\Q$search\E)}; my $length = length $search; foreach my $line ( @lines ) { my ( $string, $mask ) = split m{\s+}, $line, 2; say qq{Looking for '$search' in '$mask'}; while ( $mask =~ m{$rxSearch}g ) { my $pos = pos $mask; my $ind = q{ } x length $mask; substr $ind, $pos, $length, q{^} x $length; say qq{ Found at offset $pos}; say qq{ $string\n $mask}; say qq{ $ind - @{ [ substr $string, $pos, $length ] }} +; } } } __DATA__ THAETURQU .LLL..RRR JURYATAUTIOPW .LLL...LL.RRR TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL

    The output.

    Looking for '.LLL' in '.LLL..RRR' Found at offset 0 THAETURQU .LLL..RRR ^^^^ - THAE Looking for '.LLL' in '.LLL...LL.RRR' Found at offset 0 JURYATAUTIOPW .LLL...LL.RRR ^^^^ - JURY Looking for '.LLL' in 'L..L..LLLL.LL.R..L.LLLL' Found at offset 5 TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL ^^^^ - OWQN Found at offset 18 TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL ^^^^ - WMNH Looking for 'LLLL' in '.LLL..RRR' Looking for 'LLLL' in '.LLL...LL.RRR' Looking for 'LLLL' in 'L..L..LLLL.LL.R..L.LLLL' Found at offset 6 TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL ^^^^ - WQNQ Found at offset 19 TURIWOWQNQQUTYRIOPWMNHF L..L..LLLL.LL.R..L.LLLL ^^^^ - MNHF Looking for '.RRR' in '.LLL..RRR' Found at offset 5 THAETURQU .LLL..RRR ^^^^ - URQU Looking for '.RRR' in '.LLL...LL.RRR' Found at offset 9 JURYATAUTIOPW .LLL...LL.RRR ^^^^ - IOPW Looking for '.RRR' in 'L..L..LLLL.LL.R..L.LLLL' Looking for 'RRRR' in '.LLL..RRR' Looking for 'RRRR' in '.LLL...LL.RRR' Looking for 'RRRR' in 'L..L..LLLL.LL.R..L.LLLL'

    I hope this is useful.

    Cheers,

    JohnGG

    Update: Corrected typo.

      Thanks guys - really kind of you! The output from JohnGG's first example was best, I can deal with that. I've a lot still to learn!