gdnew has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks. I have a database consists of records separated by // and \n.
I have a very terrible program consists of lots of flags. Now I have a problem about where to check one of them ( the siteflag). Well part of my input and output files looks like follow.
Input files TITLE An excitatory scorpion toxin with a distinctive feature: an additional alpha helix at the C terminus and its implicati +ons for interaction with insect sodium channels /interaction_site="Q8, N9, Y10, N11, C12, F17, W38, R58, +V59 and K62 form the putative bioactive surface in mature toxin (Zilberberg et al., 1997)." /channel="Sodium channel" /target_cell="Insect specific (Excitatory)" /c_end="Free" // TITLE Cloning and Sequencing of an Excitatory Insect-Selective Neu +rotoxin BmKIT cDNA from Buthus martensii Karsch /interaction_site="Sequential deletions of C-terminal resi +dues suggested Ile73 and Ile74 for toxicity. {Oren et al., 1999}" /channel="Sodium channel" /c_end="Free" // Output References:TITLE "An excitatory scorpion toxin with a distinct +ive feature: an additional alpha helix at the C terminus and its impl +ications for interaction with insect sodium channels" Interaction_site "Q8, N9, Y10, N11, C12, F17, W38, R58, V59 and K62 f +orm the putative bioactive surface in mature toxin (Zilberberg et al. +, 1997)." Channel "Sodium channel" Target_cell "Insect specific (Excitatory)" C_end "Free" References:TITLE "Cloning and Sequencing of an Excitatory Inse +ct-Selecti ve Neurotoxin BmKIT cDNA from Buthus martensii Karsch" Interaction_site "Sequential deletions of C-terminal residues suggest +ed Ile73 and Ile74 for toxicity. {Oren et al., 1999}" Channel "Sodium channel" C_end "Free"
The title, interaction_site and c_end are fixed element( appear in every record). The rest are optional.
For every record in the file I check them line by line, modify some input and print it to the output file.
The title and interaction site may consist of nothing (" "), a line, or multiple line
Therefore I must use flag to keep track of the input.
The problem is, there a quite a lot of elements after the interaction site which are optional ( not consist in every record ).I include only two of them. My code looks like follow:
compile : perl prog.pl input.db result #! /usr/local/bin/perl -w #initialize all the variable, initialize flags to 0 and line to '' my $counter=1; my $file1="$ARGV[0]"; my $result=">".$ARGV[1]; my $site=''; my $titleline=''; my $siteflag=0; my $titleflag=0; open(INFO1,$file1) or die "Can't open $file1.\n"; #open file1 open(OUT,$result) or die "Can't open $result.\n"; #open result #the input files has a separator :\r\n in each line foreach(<INFO1>) { if(/\s*TITLE\s*(.*)\r/){ ######## check the title $titleflag=1; $titleline=$1; } elsif(/\s*\/interaction_site=(.*)\r/){ ######## handle the title print OUT qq(References:TITLE\t "$titleline"\n); $titleflag=0; $titleline=''; ######## check the site $site=$1; $siteflag=1; } elsif(/\s*(.*)\r/ && $titleflag==1){ $titleline.=" "; # add a white space $titleline.=$1; #concatenate the title with previous line } elsif(/\s*\/channel=(.*)\r/){ if(check2($1)){ print OUT "Channel\t $1\n"; } } elsif(/\s*\/target_cell=(.*)\r/){ if(check2($1)){ print OUT "Target_cell\t $1\n"; } } elsif(/\s*\/c_end=(.*)\r/){ ######## handle interaction site $siteflag=0; $site=''; ######## check c_end if(check2($1)){ print OUT "C_end\t $1\n"; }# end if }#end elsif ####elsif(/\s*(.*)\r && $siteflag==1){ #### $site.=" "; # add a white space #### $site.=$1; #concatenatewith previous site #### print "Site $site\n"; #### } } # end foreach sub check2 { #check whether item = empty quotes if($1 =~ /" "/){ return 0;} else{ return 1;} }
The last code preceded by #### is the one that need to be modified. If I use the code in that location it will only print the interaction site if there are more than one lines of site.
Where should I put the code in order I can print the interaction_site regardless they are consists of "" , a line or multiple line? Thanks so much...

Replies are listed 'Best First'.
(crazyinsomniac) Re: about where to check the flag
by crazyinsomniac (Prior) on Feb 07, 2002 at 09:23 UTC
    I was too dizzy after looking at your code (could've been the tequila) , so I offer a simpler strategy which I employed on more than 1 occasion:
    #!/usr/bin/perl -wT use strict; use CGI; my %defaultRecord = ( interaction_site => undef, TITLE => undef, channel => undef, target_cell =>undef, c_end => undef, ,); my $blankRecord = new CGI(\%defaultRecord); $blankRecord->param(-name => 'channel', -value => 'Sodium channel', ,); open(SAVERECORDHERE,'>','savedrecord.dat') or die "crapola $!"; $blankRecord->save(SAVERECORDHERE); close(SAVERECORDHERE);
    I will now add a quote from the CGI pod (my fav, quoting pod that is):

    SAVING THE STATE OF THE SCRIPT TO A FILE:     $query->save(FILEHANDLE) This will write the current state of the form to the provided filehandle. You can read it back in by providing a filehandle to the new() method. Note that the filehandle can be a file, a pipe, or whatever!

    The format of the saved file is:

    NAME1=VALUE1 NAME1=VALUE1' NAME2=VALUE2 NAME3=VALUE3 =
    Both name and value are URL escaped. Multi-valued CGI parameters are represented as repeated names. A session record is delimited by a single = symbol. You can write out multiple records and read them back in with several calls to new. You can do this across several sessions by opening the file in append mode, allowing you to create primitive guest books, or to keep a history of users' queries. Here's a short example of creating multiple session records:
    use CGI; open (OUT,">>test.out") || die; $records = 5; foreach (0..$records) { my $q = new CGI; $q->param(-name=>'counter',-value=>$_); $q->save(OUT); } close OUT; # reopen for reading open (IN,"test.out") || die; while (!eof(IN)) { my $q = new CGI(IN); print $q->param('counter'),"\n"; }
    Not only does the above make life simpler, it's built on the tried'and'true CGI.pm.

    Now you can concentrate on finishing your app, instead of parsing flat-files ... also, an alternative to the above CGI thingy might be to use windows ini style records, something like

    [recordorsomething] key = value k0ey = valuee [recordothersomething] k = v
    for which there is also a module on cpan (Config::INI)

    What I also like to do, as opposed to using a flat-file, is to add DB_File to the mix, which along with CGI.pm, makes for better than flatfile, and as always,makes for an easy to parse, quick to write with the security of familiarity, solution.

    Happy Coding!

    update:
    It has been brought to my attention, that gdnew is using a very peculiar dataformat, sorta like:

    COMMERCIAL SUPPLIERS SEQUENCE /exon="1-120" /intron=" " //
    mentioned in strange quotes.

    Now my question is for you gdnew, where did you get the idea to use such a bizzare format?

     
    ______crazyinsomniac_____________________________
    Of all the things I've lost, I miss my mind the most.
    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

Re: about where to check the flag
by dreadpiratepeter (Priest) on Feb 07, 2002 at 13:27 UTC
    I'm not sure I under stand the last part of your question. But regardless, You can simplify your processing by realizing that you have a delimited input file. Records are separated by // flags are separated by /. So with judicious use of split you have:
    #!/usr/local/bin/perl use strict; # pull everything into a string my $str = join("",<DATA>); # dump the newlines $str =~ s/\n/ /g; # loop through the records (// delimited) foreach (split(m!//!,$str)) { last unless /\S/; # skip that pesky last blank + record my ($title,@flags) = split(m!/!); # break out the fields (/ de +limited) $title =~ s/\s+/ /g; # kill extra whitespace print "References:$title\n"; # print the title foreach (@flags) { # loop through the fields my ($key,$value) = split(/=/); # split into pairs $value =~ s/\s+/ /g; # kill extra whitespace print ucfirst($key)."\t$value\n"; # print each one } print "\n"; } __DATA__ TITLE An excitatory scorpion toxin with a distinctive feature: an additional alpha helix at the C terminus and its implicati +ons for interaction with insect sodium channels /interaction_site="Q8, N9, Y10, N11, C12, F17, W38, R58, +V59 and K62 form the putative bioactive surface in mature toxin (Zilberberg et al., 1997)." /channel="Sodium channel" /target_cell="Insect specific (Excitatory)" /c_end="Free" // TITLE Cloning and Sequencing of an Excitatory Insect-Selective Neu +rotoxin BmKIT cDNA from Buthus martensii Karsch /interaction_site="Sequential deletions of C-terminal resi +dues suggested Ile73 and Ile74 for toxicity. {Oren et al., 1999}" /channel="Sodium channel" /c_end="Free" //


    This approach works well if the files are small. The initial read the whole file and stuff it into a string breaks down if the file is huge. If the input files are huge it can be modified to read one line at a time until it has read a full record, then stuff the record into a string and parse it. Like so:
    #!/usr/local/bin/perl use strict; my $str; # holds the records #loop through the data while (<DATA>) { chomp; # kill newlines if (m!//!) { # we have a record my ($title,@flags) = split(m!/!,$str); # break out the fields(/ d +elimited) $title =~ s/\s+/ /g; # kill extra whitespace print "References:$title\n"; # print the title foreach (@flags) { # loop through the fields my ($key,$value) = split(/=/); # split into pairs $value =~ s/\s+/ /g; # kill extra whitespace print ucfirst($key)."\t$value\n"; # print each one } print "\n"; $str = ""; # zero the input buffer } else { $str .= " " . $_; # accumulate data } } __DATA__ TITLE An excitatory scorpion toxin with a distinctive feature: an additional alpha helix at the C terminus and its implicati +ons for interaction with insect sodium channels /interaction_site="Q8, N9, Y10, N11, C12, F17, W38, R58, +V59 and K62 form the putative bioactive surface in mature toxin (Zilberberg et al., 1997)." /channel="Sodium channel" /target_cell="Insect specific (Excitatory)" /c_end="Free" // TITLE Cloning and Sequencing of an Excitatory Insect-Selective Neu +rotoxin BmKIT cDNA from Buthus martensii Karsch /interaction_site="Sequential deletions of C-terminal resi +dues suggested Ile73 and Ile74 for toxicity. {Oren et al., 1999}" /channel="Sodium channel" /c_end="Free" //


    This should also print the tags regardless of the number of lines. hope it helps.

    -pete
    Entropy is not what is used to be.