ghosh24 has asked for the wisdom of the Perl Monks concerning the following question:

Hi everybody..Here's an interesting problem to solve. I have a text file like this:
>first TTCCCAAAAAAGACCTACTAAGTCAAGCGGATGCGTTTTGTGTCTTATGG AAAGTCCCTGACGGATACGAGGCTTTGGGTGATTCGGTACGAATGATTCG GTTACCAGAACTTACCGAAGAAGAAATGGGACGAACCGAGGTTTCTCGTT CGTGTGCTAATCCTACATTCAAACATCGATTTCGATCAGAGTTTGTTTTT CATGAAGAACAGACATTCGTATTACGTGTTTACGATGAAGATTTGAGGTA >firsta TTCCCAAAAAAGACCTACTAAGTCAAGCGGATGCGTTTTGTGTCTTATGG AAAGTCCCTGACGGATACGAGGCTTTGG---------------------- -----------------AAGAAGAAATGGGACGAACCGAGGTTTCTCGTT CGTGTGCTAATCCTACATTCAAACATCGATTTCGATCAGAGTTT------ CATGAAGAACAGACATTCGTATTACGTGTTTACGATGAAGATTTGAGGTA

Both >first and >firsta containing same characters except the part with hyphens. Now is it possible to write a perl script that would extract the text starting after >firsta and before the start of - for each line? Also, would it be possible to extract the unmatched text from >first?

Please note that both >first and >firsta are in the same text file and other similar text files which I am using might contain more lines like these. Thanks a lot in advance..

Replies are listed 'Best First'.
Re: An interesting Perl problem to extract file content
by mjscott2702 (Pilgrim) on Dec 08, 2010 at 10:59 UTC
    Now is it possible to write a perl script that would extract the text starting after >firsta and before the start of - for each line?>

    Yes, it is.

    Also, would it be possible to extract the unmatched text from >first?

    Yes, it would.

    Hint - if you show what you have tried, more likely to get a helpful response.

      hi thanks..now posted my code..
Re: An interesting Perl problem to extract file content
by salva (Canon) on Dec 08, 2010 at 11:01 UTC
    Here's an interesting problem to solve

    That's not interesting, actually it is quite boring!

    If you show us what you have already tried in order to solve it and post questions about specific issues you could have expressing something in Perl we may be able to help you... but don't expect us to do your (home)work.

      hi thanks for replying me..It's not homework..please be assured..I am just a Biologist interested in Bioinformatics...so not studying in any school right now.. thanks
Re: An interesting Perl problem to extract file content
by Anonymous Monk on Dec 08, 2010 at 13:32 UTC
Re: An interesting Perl problem to extract file content
by locked_user sundialsvc4 (Abbot) on Dec 08, 2010 at 14:02 UTC

    Tasks like this usually involve simple programs, and a lot of regular-expressions.   (So if you have not yet studied Perl regular expressions, do so now.)

    The program will read the file line-by-line, chomp each line, and (presumably) concatenate strings until a meaningful group of lines has been read.   (This can be easy or hard.   If the file always contains, say, a >first line that is always followed by five lines of DNA, it’s easy.   But you should always validate the input data in case either your program, or the data file, or both, has a bug in it.)

    As you will learn in your studies of Perl regexes, regexes are a power-tool that is literally built for the task of ripping strings apart.   So, if you have even the slightest bit of uncertainty of what I am talking about right now, start there, and let the molecules fend for themselves for a few days more.

      hii..thanks for your reply..the prob is that there can be more or less lines after both >first and >firsta. No fixed number of lines or strings...

      >firsta is going to have lesser number of molecules than >first as there is unmatched portion in >firsta

      But >fist and >firsta have same number of lines..

      for eg..If no. of lines for >first is 8, then it is same for >firsta

      Seems like I am totally lost in regex of Perl...studying for last few days n I am frustrated now :(

      any suggestion?
Re: An interesting Perl problem to extract file content
by TomDLux (Vicar) on Dec 08, 2010 at 16:10 UTC

    Are you guaranteed the two strings will be the same length? Are you guaranteed they will initially match? If When they disagree, which is "the standard" and which is "in error"?

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      hi thanks for your reply..here's another small example..

      >first

      ACCGG

      ATGTTG

      GCCTAA

      >firsta

      ACCGG

      A--TTG

      --CTAA

      Here..everything in >first and >firsta is similar except the ---- part. So Perl should extract : TG GC from >first
        No..strings won't be always of same length or line..
Re: An interesting Perl problem to extract file content
by Anonymous Monk on Dec 08, 2010 at 21:52 UTC
    Hiii..guyz..thanks for replying..It's not homework..please be assured..I am just a Biologist interested in Bioinformatics...so not studying in any school right now.. I am really new in Perl..but I am trying..I have wrote a code for it..bt unfortunately it just extracts the matched part..The code I wrote..
    while( $dna=<DNA> ) { #chomp ($dna); if ($dna =~ /[ACGT]+/) { #chomp ($dna); print "$&\n"; } }
    I do not know how to get the unmatched part..thanks for any help!!
Re: An interesting Perl problem to extract file content
by 7stud (Deacon) on Dec 09, 2010 at 03:57 UTC

    Another way:

    1) Slurp the whole file:

    $/ = undef; my $whole_file = <$INFILE>;

    2) Split the file into two chunks:

    split />firsta\n/, $whole_file;

    3) Split each chunk into an array of lines.

    4) Split the corresponding lines into characters:

     my @letters = split //, $line;

    5) Use a for loop to examine each character in the second line. If the second line contains a '-' get the character at the same index position in the first line.

      hii..thanks..but my text files actually do not contain the same >first and >firsta..what to do to take any name?