I need to figure out a way to complete this coding and I am very lost. Please help. Here are the parameters.
As usual, the day and a life of a bioinformatician calls for making the large task small. We have data from the Unigene database from six different mammals, which the company you are working for needs data about tissue expression for a given gene "on Demand". Of course one could open the directory and look for the gene file, and then go through to view the tissue expression, but this would become tedious very quickly. You are to write a program with command line flags. If you need to see how to use command line flags see the solutions for assignment #2 and assignment #3
The data for these six organisms can can be found here: (DO NOT COPY THESE FILES!, DO NOT COPY THESE FILES). So if you are reading this do not copy these files, b/c they take up too much room on the server. Instead, create a variable and set it to the directory:
my $uniGene = '/data/PROGRAMMING/assignment4';
Now you can access the data using this convention:
my $infile = "$uniGene/$org/$gene"; # $org is the name of the organism we will search, $gene is the gene file.
Since we left the trailing / of the $uniGene we can separate two variable with the /
I always declare my directories in this fashion. Believe me, it will save you headaches later on.
Take a look at one of the files (Note the ending, looks like something which should be coded in a variable):
less /data/PROGRAMMING/assignment4/Homo_sapiens/TIMM9.unigene
Your program should have two command line flags, name the FIRST FLAG 'host' which tells the program which directory to look in for data. One thing this program should do is take common names as well as scientific names, so:|
Homo_sapiens or Homo sapiens or Human or Humans
Bos_tarus or Bos tarus or Cow or Cows
Equus_caballus or Equus caballus or Horse or Horses
Mus_musculus or Mus musculus or Mouse or mice
Ovis_aries or Ovis aries or Sheep or Sheeps
Rattus_norvegicus or Rattus norvegicus or Rat or Rats
This will allow a little flexibility in the flag, but if the directory does not exist, the user should be warned (see subroutine 3, below), and displayed the directories which do exist. Tell the user that the search is case sensitive. We will learn later how to do case-insensitive.
Name the SECOND FLAG 'gene'. It will take a gene name like 'PWRN1, ESF1, PVRL1, etc. This flag will be used to see if the gene exists in the given host directory, if it does it will be used for the data, if not, tell the user it does not exist and exit (see subroutine 4).
We will begin to modularize code, so if I tell you to write a subroutine, and you do not write a subroutine for part of the code, you will loose 5 points each time you fail to write a subroutine! Also, name the subroutine exactly how I name it. Finally, conform to the way I have you write the subroutine, follow the outline for the subroutiens. You should have a total of four subroutines for this assignment. Feel free to write additional subroutines if you feel it will help.
Subroutines
1). Write a subroutine (call it getGeneData, called in scalar context) that receives two arguments: 1). A gene name. 2). A host name. This subroutine opens the file for the host and gene, extracts the list of tissues in which this gene is expressed and returns a reference to a sorted array of the tissues. Remember at this point the directory has been checked to make sure it exists, so you don't have to worry about it failing at this point, but you should still use the proper file opening check! Process the file line by line. Hint: In order to get the tissue(s), use this:
if(/^EXPRESS\s+(.*)/){
my $tissues = $1;
}
Don't worry right now about what's happening with this code, it's a regular expression, and we capture what's in parentheses, which then get placed in $1. Do understand that the scalar $tissues now contains all the tissues. You should know how to get those into an array and then subsequently sort the array in alphabetical order.
2). Write another subroutine (call it printOutput, called in void context) which receives three arguments: 1). An array reference which was returned from getGeneData. 2). The gene name searched. 3). The host name given at the CLI. This subroutine should print the tissue expression data for the gene. The output should have the format seen below (OUTPUT FORMAT).
3). Write another subroutine (call it directoriesWhichExist, called in void context). which receives 0 arguments. If the user asks for a directory that does not exist, this subroutine is called, and prints out the directories which do exist, like we see above. If this subroutine is called, it exits the program.
4). Write the last subroutine (call it isValidGeneName, called in void context) which receives two arguments. 1). A gene name. 2). A host name. This subroutine will check to make sure the given gene name exists, if it does it returns a 1, else it returns a 0. You should then use this subroutine as follows:
if ( isValiedGeneName($geneName, $host) ){
print "Found Gene Name for $host\n";
}
else{
print "This Gene Name does not exists for $host, exiting now\n";
exit;
}
This is a very useful programming convention b/c the subroutine can be used in decision statements, like we did above.
OUTPUT FORMAT: (example is for the Human TGM1 gene):
In Homo sapiens, There are 41 tissues that TGM1 is expressed in:
1. adipose tissue
2. adult
3. bladder
4. bladder carcinoma
5. brain
6. breast (mammary gland) tumor
7. cervical tumor
8. cervix
9. colorectal tumor
10. embryoid body
11. embryonic tissue
12. esophageal tumor
13. esophagus
14. eye
15. fetus
16. germ cell tumor
17. head and neck tumor
18. intestine
19. kidney
20. kidney tumor
21. larynx
22. lung
23. mammary gland
24. mouth
25. muscle
26. neonate
27. non-neoplasia
28. normal
29. ovarian tumor
30. ovary
31. pancreas
32. pancreatic tumor
33. pharynx
34. placenta
35. skin
36. skin tumor
37. thymus
38. trachea
39. umbilical cord
40. uterine tumor
41. uterus
Any help would be amazing and please I know this is a lot but I am staring at a blank black page with no idea how to start.
If you reply in a private message that would be helpful too.
Thnak you,
Joe
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.