Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Can somebody please tell me what the best Perl Regular expression would be to match:

14 ASN1 YGR124W ASN1 YPR145W

With the columns 2 and 3 (or similar) being repeated in this form up to 45 times. The shown case is x 2 repetition. The spaces are tabs. The number 14 could be any number upto 25000. The words can be any numerical letter combination including the characters '-' or '(' or ')' or a letter only combination.

Replies are listed 'Best First'.
Re: Perl Regex
by pbeckingham (Parson) on Jul 08, 2004 at 12:55 UTC

    How about:

    my @matches = $string =~ / ^ # beginning of line (\d{1,5}) # 1-5 digit number \s+ # gap (?: # group ([A-Z0-9()-]{7}) # 7-char thing \s+ # gap ([A-Z0-9()-]{7}) # another 7-char thing ){1,45} # 7-char thing # up to 45 groups $ # end of line /gx;
    Update: Added /g to regex. Oops.

      This looks good, except that I want to be able to do the following:

      Convert:

      14 ASN1 YGR124W ASN1 YPR145W

      to :

      14 1 ASN1 YGR124W

      14 2 ASN1 YPR145W

      or whatever variation is possible given the previously stated row possibilities.

        Okay, then perhaps this:

        my @matches = $string =~ / ^ # beginning of line (\d{1,5}) # 1-5 digit number \s+ # gap (?: # group ([A-Z0-9()-]{7}) # 7-char thing \s+ # gap ([A-Z0-9()-]{7}) # another 7-char thing ){1,45} # 7-char thing # up to 45 groups $ # end of line /gx; my $first = shift @matches; my $count = 1; while (@matches) { print $first, ' ', $count, ' ', shift @matches, ' ', shift @matches, "\n"; $count++; }

        my ($marker, @pairs) = split; my $c = 0; while (@pairs) { ++$c; my ($thing1, $thing2) = splice(@pairs, 0, 2); print "$marker $c $thing1 $thing2\n"; }

        We're not really tightening our belts, it just feels that way because we're getting fatter.
•Re: Perl Regex
by merlyn (Sage) on Jul 08, 2004 at 15:11 UTC
    Well, /.*/ will match it. But you haven't said what you want to do with the rest.

    Part of the problem with regular expressions is just being able to say what it is you want to do. You need to solve that before we can help with a more detailed match.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      I want to do something along the lines of
      #! perl -w scipt use strict; use warnings; # open a file for reading in the current directory open(FILEHANDLE,"< input_file.txt"); open(OUTFILE,"+> outfile_file.txt"); my $record_count = 0; while (<FILEHANDLE>){ if ($_ =~ /^(\d{1,5})\s+ (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} (?:([A-Z0-9()-]{2,15})\s+([A-Z0-9()-]{7})){0,1} $/gx){ $record_count++; my $1st = $1; my $2st = $2; my $3st = $3; my $4st = $4; my $5st = $5; my $6st = $6; my $7st = $7; my $8st = $8; my $9st = $9; my $10st = $10; my $11st = $11; my $12st = $12; my $13st = $13; my $14st = $14; my $15st = $15; my $16st = $16; my $17st = $17; my $18st = $18; my $19st = $19; my $20st = $20; my $21st = $21; my $22st = $22; my $23st = $23; my $24st = $24; my $25st = $25; my $26st = $26; my $27st = $27; my $28st = $28; my $29st = $29; my $30st = $30; my $31st = $31; my $32st = $32; my $33st = $33; my $34st = $34; my $35st = $35; my $36st = $36; my $37st = $37; my $38st = $38; my $39st = $39; my $40st = $40; my $41st = $41; my $42st = $42; my $43st = $43; my $44st = $44; my $45st = $45; my $46st = $46; my $47st = $47; my $48st = $48; my $49st = $49; my $50st = $50; my $51st = $51; my @array_of_gene_records = ($2st, $3st, $4st, $5st, $6st, $7st, + $8st, $9st, $10st, $11st, $12st, $13st, $14st, $15st, $15, $16st, $1 +7st, $18st, $19st, $20st, $21st, $22st, $23st, $24st, $25st, $26st, $ +27st, $28st, $29st, $30st, $31st , $32st , $33st, $34st, $35st, $36st +, $37st, $38st, $39st, $40st, $41st, $42st, $43st, $44st, $45st, $46s +t, $47st, $48st, $49st, $50st, $51st); my $subrecord_id = 1; foreach (@array_of_gene_records){ print OUTFILE "$record_count\t$1st\t$subrecord_id\t$_\n"; $subrecord_id++; } } }

        Wow. That's very simliar to what I was trying to accompish just yesterday in the CB.

        See dcvr69's scratchpad for how I solved my particular problem... I think it would apply to you situation as well with a little tweaking.

        If someone finds a way to do it in a single match I'd love to know about it. I couldn't find a way to use (?:()())+ to return multiple captures to an array. /g wouldn't work because of a prefix capture earlier in the pattern.

        Basically, my (working but not happy with it) solution was to capture the prefix and the remainder in one match, and then do the variable element capture on the remainder with /g in array context.

        I just feel like I'm missing something about RE that would allow both steps in one match/capture, without resorting to sub calls inside the pattern or other hacks that would end up being worse that the two stage solution I have now.

      Please refer to my other post in addition to the first.
Re: Perl Regex
by gellyfish (Monsignor) on Jul 08, 2004 at 11:55 UTC

    What have you tried and found not to work?

    /J\

      Excel.

        Heh!

        The first problem I have in dealing with your question is to know what you want to do with. You want to isolate the first number in the line, that's easy enough with

        /^(\d+)/

        But then for the rest, who knows? Do you want the pairs stored in an array of arrays, do you just want to know how many there are (e.g. up to 45). Are there additional constraints on valid tokens?

        e.g. is it always asn followed by a digit in which case /asn\d+/ will match that. If you see asn1 once, will you see it for the rest of the line and nothing else? Is the second part always letter letter letter digit digit digit letter? (update: oops, I missed your explanation of what they contain at a first glance, but the specification is still a little on the vague side).

        At a guess I would say these are postcodes. What are you trying to do, maybe someone has already solved the problem and a module has been written that will do what you want.

        - another intruder with the mooring of the heat of the Perl