cowboy007 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys,
I have a Tab delimited file like below:-
Chr21 NT_113958 STS 76092 76265 Chr21 NT_113958 number=1
My code is doing like this:-
while(<FH>) { chomp($_); $_ =~ s/^\s+//; $_ =~ s/\s+$//; if(($_ =~ /\A(\S+)\t(\S+)\t(\S+)\t(\S+)\t(\S+)/xmg) || ($_ =~ / +number\=(\S+)/xmg)) { print "$1\t$2\t$3\t$4\t$5\t$6\n"; } }
My output is coming like this:-
Chr21 NT_113958 STS 76092 76265 1
But I want it to be like this:-
Chr21 NT_113958 STS 76092 76265 1
Can somebody give me clue what I should do ?
Thanks

Replies are listed 'Best First'.
Re: Formatting clue
by toolic (Bishop) on Feb 18, 2009 at 03:02 UTC
    If you put use warnings; at the top of your code you would get some hints that you have a problem.

    Your 2 regexes work for two different lines. The first line matches the left regex, and the first 5 special vaiables are set ($1 - $5). But, since $6 is not set from the 1st input line, you get a warning "Use of uninitialized string...".

    The 2nd line matches the right regex, but it only sets $1. Maybe you're looking for something more like this:

    use strict; use warnings; while (<DATA>) { chomp; s/^\s+//; s/\s+$//; my @tokens = split /\t/; if ((scalar @tokens) == 5) { print $_; } elsif ((scalar @tokens) == 3) { if (/number=(.*)/) { print "\t$1\n" } } } # not sure how many tabs you have in your 2nd line __DATA__ Chr21 NT_113958 STS 76092 76265 Chr21 NT_113958 number=1
      Ya thats a great idea.. Thanks
Re: Formatting clue
by graff (Chancellor) on Feb 18, 2009 at 03:45 UTC
    If I understand the OP description and sample data correctly, I think toolic's solution with split would need to change slightly -- either here:
    my @tokens = split /\t+/; # split on one or more consecutive tabs
    or here:
    elsif ((scalar @tokens) == 6) { # if there are 6 fields (and 3 ar +e empty)
    But of course you wouldn't want to make both changes, because that wouldn't work.

    I think it can be risky to base a solution on just two lines of sample data. In a case like this, we can hope that data lines always come in pairs, that each pair always has the same values in the first two columns, that the first of each pair always has 5 adjacent non-empty fields, that the second always has the 2 "repeated" fields, 3 empty fields and "number=\d+" in a 6th field, that there aren't extra spaces next to any of the field-delimiting tabs, and so on. Wouldn't that be nice...

    The question is, what sorts of "deviations" from those patterns do you need to worry about, and what should the script do when those sorts of things pop up (as they almost certainly will)? Just guessing:

    use strict; use warnings; my @comp; # open FH in some suitable way... while(<FH>) { s/^\s+//; s/\s+$//; my @flds = split( / *\t */ ); # tabs might have spaces around the +m if ( @flds == 5 ) { # presumably first line of pair warn "Input line $. replaces previous first-line data: @comp\n +" if ( @comp ); @comp = @flds; } elsif ( @flds == 6 and $flds[5] =~ /number=(\d+)/ and $flds[0].$flds[1] eq $comp[0].$comp[1] ) { push @comp, $1; print join( "\t", @comp ), "\n"; @comp = (); } else { warn "Input line $. ignored: $_\n"; } }
Re: Formatting clue
by scott\b (Novice) on Feb 18, 2009 at 14:51 UTC

    As a rule, don't trust data. Or rather, trust but verify.

    I recommend that you put in sanity checks to confirm that the first line, the last line, and everything in between, is what you expect. For example, what happens if you get a first column that does not match the pattern /Chr\d+/ (aka chromosome21), but instead get ChrX, or Chr?, or ? instead? What happens if instead of getting a reference sequence id (e.g. NT_113958) that you can pass on to the genome browser, you get something unique to the research organization providing the data (e.g. adhoc_1234)? Those are just two examples of gotchas that bit me when processing supposedly well formed genetics data.

    I raise specific issues around bioinformations, but the mindset applies to all processing. If you have source data assumptions, have your program verify them.

    Scott\b
Re: Formatting clue
by Anonymous Monk on Feb 18, 2009 at 06:54 UTC
    hi friend, Try out this one
    open(FILE,"txt2.txt") or die $!; while(<FILE>){ if($_ =~/(\S+)\t(\S+)\t(\S+)\t(\S+)\t(\S+)/ || $_ =~/number=(\S+)/){ print "$1\t$2\t$3\t$4\t$5\t$6\t"; } }
      Does that produce the desired output? If warnings are turned on, does it run without warnings?