joomanji has asked for the wisdom of the Perl Monks concerning the following question:

Guys.. really appreciate all your help, especially Ovid and QM! First i would like to clear out that this is not my homework.. :) haha

I have try out the 2 script that Ovid and QM provided.. it seems not working for me, or i should give u all a clearer view of the input and the output i needed, i will make it clear on the input format 1st. The output of the trf program will be a text file.

Input:
Tandem Repeats Finder Program writen by: Gary Benson
Department of Biomathematical Sciences
Mount Sinai School of Medicine
Version 3.21

Sequence: contig001
Parameters: 2 7 7 80 10 50 50
506 540 6 2.0 6 100 0 70 42 42 0 14 1.45 AAACCC AAACCAAACCC
664 691 3 9.3 3 100 0 56 32 35 32 0 1.58 CAG CAGCAGCAGCAG
2642 2668 3 9.0 3 100 0 54 0 33 33 33 1.58 CTG CTGCTGCTGCTG

Sequence: contig002
Parameters: 2 7 7 80 10 50 50
128 188 3 20.3 3 80 6 70 34 29 31 4 1.79 GCA GCAGCAGCAGCAA
313 357 3 15.0 3 81 9 56 35 33 28 2 1.70 AGC AGCAGCAGCAGC

let me explain what is the output. First the software will quote a few lines of the author, and following by the output. The identifier for the fisrt line should be "Sequence:" --> because the name of the sequence will always differ depend on the input. Secondly, the output will show the parameters use in the program, but all the parameter should be the same thru out the whole output. thirdly is the information of the repeats. As for the 1st sequence, there are 3 seperate line indicating the information of repeat and 2 for the 2nd sequence. Every line will have 13 numbers and 2 alhabet variable, which is the repeat unit and the consensus sequence..

Output:
Contig001 506 540 6 2.0 6 100 0 70 42 42 0 14 1.45 AAACCC AAACCAAACCC
Contig001 664 691 3 9.3 3 100 0 56 32 35 32 0 1.58 CAG CAGCAGCAGCAG
Contig001 2642 2668 3 9.0 3 100 0 54 0 33 33 33 1.58 CTG CTGCTGCTGCTG
contig002 128 188 3 20.3 3 80 6 70 34 29 31 4 1.79 GCA GCAGCAGCAGCAA
Contig002 313 357 3 15.0 3 81 9 56 35 33 28 2 1.70 AGC AGCAGCAGCAGC

Ok this is the output i needed, So for the above example, i wish to get only 5 lines output. So that i can either import into excel, or even mySql database. Again.. all of your help is higly aprreciated!

P.S -- some of the ouput may give blanks result such as :

Sequence: contig003
Parameters: 2 7 7 80 10 50 50
Sequence: contig004
Parameters: 2 7 7 80 10 50 50
128 188 3 20.3 3 80 6 70 34 29 31 4 1.79 GCA GCAGCAGCAGCAATAGCAGCAG

Replies are listed 'Best First'.
Re: Perl Parser, need help
by QM (Parson) on May 14, 2005 at 04:21 UTC
    It would have been nice if you'd specified exactly what you wanted, plus the example. But guessing at the format (untested):
    #!/your/perl/here use strict; use warnings; my $sequence; while (my $line = <>) { # Sequence keyword if ($line =~ /^Sequence:\s+(contig\d+)/) { $sequence = $1; # save the sequence next; } # Parameter keyword next if ($line =~ /^Parameters:\s+(.*)$/); # print any other non-blank lines if ($line != /^\s*$/) { print "$sequence $line"; next; } }
    I don't know what your spec is for line breaks, or whether blank lines have meaning, or what to do for lines that don't begin with Parameter or Sequence keywords.

    Please use <code> tags to show code and output, otherwise we have to check the page source to see the actual format. (And I still might have got the input format wrong.)

    Update: Corrected Parameters: error as pointed out by Ovid. The Parameters: line is just skipped.

    Update 2: Adding missing semicolon to Parameter line.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Bleh. I think your's is a bit nicer. I read all of the data into memory as a quick hack. I should have iterated over the lines. Still, I'm wondering what you're doing with parameters. The OP appears to be discarding that information. Did I miss something?

      Cheers,
      Ovid

      New address of my CGI Course.

        Still, I'm wondering what you're doing with parameters. The OP appears to be discarding that information. Did I miss something?
        No, I think I missed something. Didn't check the page source close enough. I thought the Parameters: 2 7 7 80 10 50 50 was part of the line following it.

        I'll go update it now to correct that.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      #!/your/perl/here
      use strict;
      use warnings;
      my $sequence;

      while (my $line = <>)
      {
      # Sequence keyword
      if ($line =~ /^Sequence:\s+(contig\d+)/)
      {
      $sequence = $1; # save the sequence
      next;
      }

      # Parameter keyword
      #next if ($line =~ /^Parameters:\s+(.*)$/)

      # print any other non-blank lines
      if ($line != /^\s*$/)
      {
      print "$sequence $line";
      next;
      }
      }

      QM..

      I have tried your script, before i mark the parameter keyword out, the script have compilation error at line 24. After i mark that line, i manage to run the script with error, the compiler stated : isn't numeric in numeric ne (!=) at trf.pl line 19. But I still manage to get the output, after edited in excel. But funny thing is that, for those result starting from 1 (eg: 1 45 7 6.4 7 97 0 72 35 44 4 15 1.67 CCTAAAC CCTAAACCCTAAACCCTAAACCCTAAACCCTAAGCCCTAAGCCCT, it won't show in the parsed result... i wonder what's with this.. but anyway thank u so much!
        I did mention that it was untested.

        That line is missing a terminating semicolon, fixed.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Re: Perl Parser, need help
by Ovid (Cardinal) on May 14, 2005 at 04:11 UTC

    Since this is science related, I want to help, but help from you would be good. First, I would recommend listing the code you used to solve your first problem. Second, I would more clearly state the steps necessary to transform your input to your output.

    I took a wild swing at solving your problem, but here are the assumptions I made (and they could be erroneous).

    • This isn't homework (I hope it's not!)
    • You appear to be discarding the parameter information so I did, too.
    • The input and output formats were vague, so I guessed after looking at the raw source of your node.
    • Data will always start with a "Sequence: ..." (and sequences will not overlap)
    • The "contig$digits" information will prepend every subsequent line until the next "Sequence".

    My stab at it:

Re: Perl Parser, need help
by davidrw (Prior) on May 14, 2005 at 14:34 UTC
    The above approaches are probably more robust, but here's a quick & dirty alternative:
    # DOS quoting: perl -ne "$contig=$1 if /^Sequence:\s+(\S+)/; print \"$contig $_\" if +$_!~/^(\S+:)/ && /\S/" dat.txt # *nix quoting: perl -ne '$contig=$1 if /^Sequence:\s+(\S+)/; print "$contig $_" if $_ +!~/^(\S+:)/ && /\S/' dat.txt
    Either just redirect the output to a file (or pipe it elsewhere), or you could use the -i argument (man perlrun) as well.
      Maybe it's just too late in my caffeine cycle, but I didn't think you could simply escape double-quotes in DOS, can you? That would make it too easy, and I'm pretty sure I remember otherwise.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        I too was shocked to see it work, and yesterday was a caffeine-less day for me, so it must be true! Yesterday was on win2k -- i just tried it with success on win98 as well!
        C:\temp>echo "blah \" ad" "blah \" ad" C:\temp>echo blah \" ad blah \" ad