in reply to Re: Making use of a hash of an array...
in thread Making use of a hash of an array...

Back to basics

I'm trying to capture the start and end values. With a simple file such as:

Regex used: /^$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ >hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44 FORWARD -4.6 12 .. 35 xxxxGTGTGTGGTCT GC TTCAGTGACTTCGAGG +CGCG GC AGCTGCTCCGAGTCC -5.5 11 .. 36 xxxxxGTGTGTGGTC TGC TTCAGTGACTTCGAGG +CGCG GCA GCTGCTCCGAGTCCT

I am able to capture the start and end values:

Dumper: $VAR1 = 'hsa_circ_0075116|chr5:175956288-175956388-|NM_014901|RNF44 F +ORWARD'; $VAR2 = [ { 'end' => '35', 'start' => '12' }, { 'end' => '36', 'start' => '11' }

But when I make a slight amendment in the regex to account for the lines which begin with a whitespace such as:

New regex: /^(\s+)?$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ ## addition of (\s+)? to the beginning *\s*-5 56 .. 70 CTATGCCCCTTATTG TATCTG GGG C +AGATG ATCGTCAAGTGAAGA

The start values become undefined:

$VAR125 = 'hsa_circ_0067224|chr3:128345575-128345675-|NM_002950|RPN1 +FORWARD'; $VAR126 = [ { 'end' => '6', 'start' => undef }
Are the brackets used for optional capture at the beginning of my regex confusing what is captured by my $start and $end variables?

Whole script so far:

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; use Regexp::Common qw /number/; open my $hairpin_file, '<', "new_xt_spacer_results.hairpin", or die $! +; my %HoA_sequences; my $curkey; while (<$hairpin_file>){ chomp; if (/^>(\w+\d+\|\w+:\d+-\d+[-|+]\|\w+\|\w+\s+\w+$)/){ $curkey = $1; }elsif (my ($start, $end) = /^(\s+)?$RE{num}{real}\s+(\d+)\s+\.\.\s+(\d+)\s*/ ) { die "value seen before header: '$_'" unless defined $curkey; push @{ $HoA_sequences{$curkey}}, { start=>$start, end=>$end }; } else { die "don't know how to parse: '$_'" } } print Dumper(%HoA_sequences);

Replies are listed 'Best First'.
Re^3: Making use of a hash of an array...
by 1nickt (Canon) on Jul 19, 2017 at 20:47 UTC

    Are the brackets used for optional capture at the beginning of my regex confusing what is captured by my $start and $end variables?

    Yep (you could have tested this yourself). That is capturing the leading space into $1 and shifting the other two captures down to $2 and $3.

    Since you don't need to capture the leading space, don't use parentheses (or, when you must use parentheses but don't want to capture the match, use, um ... non-capturing parentheses, like: (?:foo|bar)). Since you don't know if there will be any matches, use the zero-or-more quantifier. Maybe you want:

    /^ \s* $RE{num}{real} \s+ (\d+) \s+ \. \. \s+ (\d+) \s* /x


    The way forward always starts with a minimal test.
      Aha, thanks.