Re^3: split a file into records and process it

Then I'd suggest that rather than have a single loop reading lines, and then having conditionals to decide what to do with each type of line, you use a loop that terminates on eof and reads the individual lines of each record in line. This makes for a more robust parser with less confusing conditional code and line to line state.

This doesn't do the final extraction of the required parts from individual liens of the records, which is easily added, but serves to demonstrate the technique:

#! perl -slw
use strict;
use Data::Dump qw[ pp ];

my %records;
until( eof( DATA ) ) {
    chomp( my $exon = <DATA> );
    push @{ $records{ $exon } }, {};

    my $seqs = 1;
    my $line = <DATA>;
    if( $line =~ m[(\d+) different hits] ) {
        $seqs = $1;
        chomp( $records{ $exon }[ -1 ]{ gene_id } = <DATA> );
    }
    else {
        chomp( $records{ $exon }[ -1 ]{ gene_id } = $line );
    }

    chomp( $records{ $exon }[ -1 ]{ Nm_id } = <DATA> );

    chomp( $records{ $exon }[ -1 ]{ snoRNA_key } = <DATA> );

    for( 1 .. $seqs ) {
        chomp( my $query = <DATA> );
        scalar (<DATA>);
        chomp( my $sbjct = <DATA> );
        push @{ $records{ $exon }[ -1 ]{ seqs } }, { $query => $sbjct 
+};
    }

    chomp( $records{ $exon }[ -1 ]{ gene_name } = <DATA> );

    chomp( $records{ $exon }[ -1 ]{ web_link  } = <DATA> );
}

pp \%records;

__DATA__
3
GI:91982771
NM_001040105.1
snoRNA 10
Query  4     TGGAGTCAAT  13
             ||||||||||
Sbjct  4854  TGGAGTCAAT  4845
Homo sapiens mucin 17, cell surface associated (MUC17), mRNA.
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do
+pt=GenBank&RID=UDU305DZ01N&log%24=nuclalign&blast_rank=97&list_uids=9
+1982771
3
GI:154448895
NM_001100162.1
snoRNA 25, 26 and 27
Query  2    CCTGGAGTCGAGTG  15
            ||||||||||||||
Sbjct  146  CCTGGAGTCGAGTG  133
Homo sapiens exportin 7 (XPO7), transcript variant 3, mRNA.
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do
+pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=2&list_uids=15
+4448895
31
4 different hits
GI:153945877
NM_002458.1
snoRNA 25, 26 and 27
Query  3     CTGGAGTCGAGTG  15
             |||||||||||||
Sbjct  6818  CTGGAGTCGAGTG  6806
Query  3     CTGGAGTCGAGTG  15
             |||||||||||||
Sbjct  8489  CTGGAGTCGAGTG  8477
Query  3      CTGGAGTCGAGTG  15
              |||||||||||||
Sbjct  10589  CTGGAGTCGAGTG  10577
Query  3      CTGGAGTCGAGTG  15
              |||||||||||||
Sbjct  12260  CTGGAGTCGAGTG  12248
Homo sapiens mucin 5B, oligomeric mucus/gel-forming (MUC5B), mRNA.
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do
+pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=9&list_uids=15
+3945877
4
GI:150418008
NM_206862.2
snoRNA 25, 26 and 27
Query  1     ACCTGGAGTCGAG  13
             |||||||||||||
Sbjct  4775  ACCTGGAGTCGAG  4763
Homo sapiens transforming, acidic coiled-coil containing protein 2 (TA
+CC2), transcript variant 1, mRNA.
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do
+pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=10&list_uids=1
+50418008
[download]

Output:

C:\test>junk55
{
  3  => [
          {
            Nm_id      => "NM_001040105.1",
            gene_id    => "GI:91982771",
            gene_name  => "Homo sapiens mucin 17, cell surface associa
+ted (MUC17), mRNA.",
            seqs       => [
                            {
                              "Query  4     TGGAGTCAAT  13" => "Sbjct 
+ 4854  TGGAGTCAAT  4845",
                            },
                          ],
            snoRNA_key => "snoRNA 10",
            web_link   => "http://www.ncbi.nlm.nih.gov/sites/entrez?cm
+d=Retrieve&db=nucleotide&dopt=GenBank&RID=UDU305DZ01N&log%24=nuclalig
+n&blast_rank=97&list_uids=91982771",
          },
          {
            Nm_id      => "NM_001100162.1",
            gene_id    => "GI:154448895",
            gene_name  => "Homo sapiens exportin 7 (XPO7), transcript 
+variant 3, mRNA.",
            seqs       => [
                            {
                              "Query  2    CCTGGAGTCGAGTG  15" => "Sbj
+ct  146  CCTGGAGTCGAGTG  133",
                            },
                          ],
            snoRNA_key => "snoRNA 25, 26 and 27",
            web_link   => "http://www.ncbi.nlm.nih.gov/sites/entrez?cm
+d=Retrieve&db=nucleotide&dopt=GenBank&RID=UDW41RSS01S&log%24=nuclalig
+n&blast_rank=2&list_uids=154448895",
          },
        ],
  4  => [
          {
            Nm_id      => "NM_206862.2",
            gene_id    => "GI:150418008",
            gene_name  => "Homo sapiens transforming, acidic coiled-co
+il containing protein 2 (TACC2), transcript variant 1, mRNA.",
            seqs       => [
                            {
                              "Query  1     ACCTGGAGTCGAG  13" => "Sbj
+ct  4775  ACCTGGAGTCGAG  4763",
                            },
                          ],
            snoRNA_key => "snoRNA 25, 26 and 27",
            web_link   => "http://www.ncbi.nlm.nih.gov/sites/entrez?cm
+d=Retrieve&db=nucleotide&dopt=GenBank&RID=UDW41RSS01S&log%24=nuclalig
+n&blast_rank=10&list_uids=150418008",
          },
        ],
  31 => [
          {
            Nm_id      => "NM_002458.1",
            gene_id    => "GI:153945877",
            gene_name  => "Homo sapiens mucin 5B, oligomeric mucus/gel
+-forming (MUC5B), mRNA.",
            seqs       => [
                            {
                              "Query  3     CTGGAGTCGAGTG  15" => "Sbj
+ct  6818  CTGGAGTCGAGTG  6806",
                            },
                            {
                              "Query  3     CTGGAGTCGAGTG  15" => "Sbj
+ct  8489  CTGGAGTCGAGTG  8477",
                            },
                            {
                              "Query  3      CTGGAGTCGAGTG  15" => "Sb
+jct  10589  CTGGAGTCGAGTG  10577",
                            },
                            {
                              "Query  3      CTGGAGTCGAGTG  15" => "Sb
+jct  12260  CTGGAGTCGAGTG  12248",
                            },
                          ],
            snoRNA_key => "snoRNA 25, 26 and 27",
            web_link   => "http://www.ncbi.nlm.nih.gov/sites/entrez?cm
+d=Retrieve&db=nucleotide&dopt=GenBank&RID=UDW41RSS01S&log%24=nuclalig
+n&blast_rank=9&list_uids=153945877",
          },
        ],
}
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

Comment on Re^3: split a file into records and process it Select or Download Code

Replies are listed 'Best First'.
Re^4: split a file into records and process it by biohisham (Priest) on Mar 25, 2010 at 15:23 UTC
Right, the character limit is frustrating, how I would sort by snoRNA, well since the snoRNAs 25, 26 and 27 are the same ,sequence-wise, then I named that particular field "snoRNA 26 27 and 28" accordingly, hence sorting it would be as "25,26 and 27". So consider: 6 GI:50845406 NM_031444.2 snoRNA3 -9 Box D except for snoRNA4 Query 3 CTGGAGTCAAGGCT 16 \|\|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 1297 CTGGAGTCAAGGCT 1284 Homo sapiens chromosome 22 open reading frame 13 (C22orf13), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UEDK8KR6016&log%24=nuclalign&blast_rank=13&list_uids=5 +0845406 5 GI:38327560 NM_006282.2 snoRNA3 -9 Box D except for snoRNA4 Query 5 GGAGTCAAGGCTAC 18 \|\|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 5129 GGAGTCAAGGCTAC 5116 Homo sapiens serine/threonine kinase 4 (STK4), mRNA http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UEDK8KR6016&log%24=nuclalign&blast_rank=14&list_uids=3 +8327560 3 GI:91982771 NM_001040105.1 snoRNA 10 Query 4 TGGAGTCAAT 13 \|\|\|\|\|\|\|\|\|\| Sbjct 4854 TGGAGTCAAT 4845 Homo sapiens mucin 17, cell surface associated (MUC17), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDU305DZ01N&log%24=nuclalign&blast_rank=97&list_uids=9 +1982771 3 GI:154448895 NM_001100162.1 snoRNA 25, 26 and 27 Query 2 CCTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 146 CCTGGAGTCGAGTG 133 Homo sapiens exportin 7 (XPO7), transcript variant 3, mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=2&list_uids=15 +4448895 31 GI:153945877 NM_002458.1 snoRNA 25, 26 and 27 Query 3 CTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 6818 CTGGAGTCGAGTG 6806 Query 3 CTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 8489 CTGGAGTCGAGTG 8477 Query 3 CTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 10589 CTGGAGTCGAGTG 10577 Query 3 CTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 12260 CTGGAGTCGAGTG 12248 Homo sapiens mucin 5B, oligomeric mucus/gel-forming (MUC5B), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=9&list_uids=15 +3945877 [download] I would wanna sort according to : snoRNA 10 snoRNA3 -9 Box D except for snoRNA4 snoRNA 25, 26 and 2 So I would get the GI, NM, seqs (query and subject), geneName, exon# and weblink for each one of the snoRNAs, a structure like `'snoRNA 25,26 and 27'=>[ { GI=>'GI:15444889', NM=>'NM_001100162.1', exon=>'3', seq=>[ Query 2 CCTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 146 CCTGGAGTCGAGTG 133 ], geneName =>'Homo sapiens mucin 17, cell surface associated +(MUC17), mRNA.', weblink=>'http://......' }, { GI=>'GI:153945877', NM=>'NM_002458.1' exon=>'31', seq=>[ #more than one seq ] }, ], 'snoRNA3 -9 Box D except for snoRNA4'=>[ #more than record once again ....... ]` [download] UPDATE:: Here's my adaptation of your earlier code, I just would need to change the order in which a record is arranged by bringing the "snoRNA" to the top before the "exon" and ensure that spaces are avoided at all costs this works the same way as the code you have modified from Re: split a file into records and process it ... use strict; use Data::Dump qw[ pp ]; my %records; until(eof(DATA)){ chomp(my $snoRNA = <DATA>); push @{$records{$snoRNA}},{} ; my $seqs = 1; my $line = <DATA>; if( $line =~ m[(\d+) different hits] ) { $seqs = $1; chomp( $records{ $snoRNA }[ -1 ]{ exon } = <DATA> ); } else { chomp( $records{ $snoRNA }[ -1 ]{ exon } = $li +ne ); } chomp( $records{ $snoRNA }[ -1 ]{ GeneID } = <DATA> ); chomp( $records{ $snoRNA }[ -1 ]{ NM_ID } = <DATA> ); for( 1 .. $seqs ) { chomp( my $query = <DATA> ); scalar (<DATA>); chomp( my $sbjct = <DATA> ); push @{ $records{ $snoRNA }[ -1 ]{ seqs } }, { $query => $sbjc +t }; } chomp( $records{ $snoRNA }[ -1 ]{ gene_name } = <DATA> ); chomp( $records{ $snoRNA }[ -1 ]{ web_link } = <DATA> ); } pp \%records; __DATA__ snoRNA 25, 26 and 27 2 GI:142387131 NM_006299.3 Query 2 CCTGGAGTCGAGT 14 \|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 371 CCTGGAGTCGAGT 359 Homo sapiens zinc finger protein 193 (ZNF193), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=11&list_uids=1 +42387131 snoRNA 25, 26 and 27 1 NM_001005236.3 GI:256773198 Query 3 CTGGAGTCGAGTGTCT 18 \|\|\|\|\|\| \|\|\|\|\|\|\|\|\| Sbjct 168 CTGGAGACGAGTGTCT 153 Homo sapiens olfactory receptor, family 1, subfamily L, member 1 (OR1L +1), mRNA. http://www.ncbi.nlm.nih.gov/ snoRNA 25, 26 and 27 4 different hits 31 GI:153945877 NM_002458.1 Query 3 CTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 6818 CTGGAGTCGAGTG 6806 Query 3 CTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 8489 CTGGAGTCGAGTG 8477 Query 3 CTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 10589 CTGGAGTCGAGTG 10577 Query 3 CTGGAGTCGAGTG 15 \|\|\|\|\|\|\|\|\|\|\|\|\| Sbjct 12260 CTGGAGTCGAGTG 12248 Homo sapiens mucin 5B, oligomeric mucus/gel-forming (MUC5B), mRNA. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=nucleotide&do +pt=GenBank&RID=UDW41RSS01S&log%24=nuclalign&blast_rank=9&list_uids=15 +3945877 [download] Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l] [select]
Re^5: split a file into records and process it by BrowserUk (Patriarch) on Mar 25, 2010 at 15:58 UTC
Okay. You want the snoRNA text used as the primary key. Output like this? Read more... (5 kB) If so, then the changes required from Re: split a file into records and process it are minimal: #! perl -slw use strict; use Data::Dump qw[ pp ]; my %records; until( eof( DATA ) ) { my %record; ## put the exon number inside the record ( $record{ exon } ) = ( <DATA> =~ m[(\d+)] ); my $seqs = 1; my $line = <DATA>; if( $line =~ m[(\d+) different hits] ) { $seqs = $1; $line = <DATA>; } ( $record{ gene_id } ) = ( $line =~ m[GI:(\d+)] ); ( $record{ Nm_id } ) = ( <DATA> =~ m[(NM_\d[\d]+)] ); ## save the snoRNA text... chomp( my $snoRNA_key = <DATA> ); for( 1 .. $seqs ) { my $query = [ split ' ', <DATA> ]; shift @$query; scalar (<DATA>); my $sbjct = [ split ' ', <DATA> ]; shift @$sbjct; push @{ $record{ seqs } }, { query => $query, sbjct => $sbjct +}; } chomp( $record{ gene_name } = <DATA> ); chomp( $record{ web_link } = <DATA> ); ## And use it as the primary key in the main hash push @{ $records{ $snoRNA_key } }, \%record; } pp \%records; __DATA__ [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l] [select]