An interesting Perl problem to extract file content

ghosh24 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: An interesting Perl problem to extract file content by mjscott2702 (Pilgrim) on Dec 08, 2010 at 10:59 UTC
Now is it possible to write a perl script that would extract the text starting after >firsta and before the start of - for each line?> Yes, it is. Also, would it be possible to extract the unmatched text from >first? Yes, it would. Hint - if you show what you have tried, more likely to get a helpful response.	[reply]
Re^2: An interesting Perl problem to extract file content by Anonymous Monk on Dec 08, 2010 at 21:53 UTC
hi thanks..now posted my code..	[reply]
Re: An interesting Perl problem to extract file content by salva (Canon) on Dec 08, 2010 at 11:01 UTC
Here's an interesting problem to solve That's not interesting, actually it is quite boring! If you show us what you have already tried in order to solve it and post questions about specific issues you could have expressing something in Perl we may be able to help you... but don't expect us to do your (home)work.	[reply]
Re^2: An interesting Perl problem to extract file content by Anonymous Monk on Dec 08, 2010 at 21:54 UTC
hi thanks for replying me..It's not homework..please be assured..I am just a Biologist interested in Bioinformatics...so not studying in any school right now.. thanks	[reply]
Re: An interesting Perl problem to extract file content by Anonymous Monk on Dec 08, 2010 at 13:32 UTC
bioperl site:perlmonks.org ?node_id=3989;BIT=bioperl;HIT=pars Subroutine to parse BLAST Compare fasta files with different headers Bioperl Sequence Retrieval Bioinformatic task Bioinformatics: Slow Parsing of a Fasta File bioperl parsing blast Subroutine to parse BLAST Subroutine to parse BLAST	[reply]
Re: An interesting Perl problem to extract file content by locked_user sundialsvc4 (Abbot) on Dec 08, 2010 at 14:02 UTC
Tasks like this usually involve simple programs, and a lot of regular-expressions. (So if you have not yet studied Perl regular expressions, do so now.) The program will read the file line-by-line, `chomp` each line, and (presumably) concatenate strings until a meaningful group of lines has been read. (This can be easy or hard. If the file always contains, say, a `>first` line that is always followed by five lines of DNA, it’s easy. But you should always validate the input data in case either your program, or the data file, or both, has a bug in it.) As you will learn in your studies of Perl regexes, regexes are a power-tool that is literally built for the task of ripping strings apart. So, if you have even the slightest bit of uncertainty of what I am talking about right now, start there, and let the molecules fend for themselves for a few days more.
Re^2: An interesting Perl problem to extract file content by Anonymous Monk on Dec 08, 2010 at 22:09 UTC
hii..thanks for your reply..the prob is that there can be more or less lines after both >first and >firsta. No fixed number of lines or strings... >firsta is going to have lesser number of molecules than >first as there is unmatched portion in >firsta But >fist and >firsta have same number of lines.. for eg..If no. of lines for >first is 8, then it is same for >firsta Seems like I am totally lost in regex of Perl...studying for last few days n I am frustrated now :( any suggestion?	[reply]
Re: An interesting Perl problem to extract file content by TomDLux (Vicar) on Dec 08, 2010 at 16:10 UTC
Are you guaranteed the two strings will be the same length? Are you guaranteed they will initially match? If When they disagree, which is "the standard" and which is "in error"? As Occam said: Entia non sunt multiplicanda praeter necessitatem.	[reply]
Re^2: An interesting Perl problem to extract file content by Anonymous Monk on Dec 08, 2010 at 21:58 UTC
hi thanks for your reply..here's another small example.. >first ACCGG ATGTTG GCCTAA >firsta ACCGG A--TTG --CTAA Here..everything in >first and >firsta is similar except the ---- part. So Perl should extract : TG GC from >first	[reply]
Re^3: An interesting Perl problem to extract file content by Anonymous Monk on Dec 08, 2010 at 22:00 UTC
No..strings won't be always of same length or line..	[reply]
Re: An interesting Perl problem to extract file content by Anonymous Monk on Dec 08, 2010 at 21:52 UTC
Hiii..guyz..thanks for replying..It's not homework..please be assured..I am just a Biologist interested in Bioinformatics...so not studying in any school right now.. I am really new in Perl..but I am trying..I have wrote a code for it..bt unfortunately it just extracts the matched part..The code I wrote.. `while( $dna=<DNA> ) { #chomp ($dna); if ($dna =~ /[ACGT]+/) { #chomp ($dna); print "$&\n"; } }` [download] I do not know how to get the unmatched part..thanks for any help!!	[reply] [d/l]
Re: An interesting Perl problem to extract file content by 7stud (Deacon) on Dec 09, 2010 at 03:57 UTC
Another way: 1) Slurp the whole file: `$/ = undef; my $whole_file = <$INFILE>;` [download] 2) Split the file into two chunks: `split />firsta\n/, $whole_file;` 3) Split each chunk into an array of lines. 4) Split the corresponding lines into characters: `my @letters = split //, $line;` 5) Use a for loop to examine each character in the second line. If the second line contains a '-' get the character at the same index position in the first line.	[reply] [d/l] [select]
Re^2: An interesting Perl problem to extract file content by Anonymous Monk on Dec 09, 2010 at 07:12 UTC
hii..thanks..but my text files actually do not contain the same >first and >firsta..what to do to take any name?	[reply]