in reply to How would you do this?
Bows... Thanks for everyone help just to add a bit more information. I am downloading compressed files which has the format like this below and parsing each of those files like this below: Bows Thank alot again.
$cmd = "zcat *.gz \| grep -P \'\\!dataset\' \| grep -P \'\\=\' \| perl -pi - +e s\'\/^\\!dataset\\_\|\\=\.*\/\/g\' | sort | uniq >VARIABLES"; print "$cmd\n"; system($cmd);
Contents of VARIABLES would be everything between "!dataset_" and " = .*"
Contents of compressed files:
^DATABASE = Geo !Database_name = Gene Expression Omnibus (GEO) !Database_institute = NCBI NLM NIH !Database_web_link = http://www.ncbi.nlm.nih.gov/projects/geo !Database_email = geo@ncbi.nlm.nih.gov !Database_ref = Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6 ^DATASET = GDS100 !dataset_title = UV exposure time course (ecoli_8.0) !dataset_description = Time course of UV-responsive genes and their ro +le in cellular recovery. lexA SOS-deficient strains analyzed. !dataset_type = gene expression array-based !dataset_pubmed_id = 11333217 !dataset_platform = GPL18 !dataset_platform_organism = Escherichia coli !dataset_platform_technology_type = spotted DNA/cDNA !dataset_feature_count = 5764 !dataset_sample_organism = Escherichia coli !dataset_sample_type = RNA !dataset_channel_count = 2 !dataset_sample_count = 8 !dataset_value_type = log ratio !dataset_reference_series = GSE9 !dataset_order = none !dataset_update_date = Apr 06 2003 ^SUBSET = GDS100_1 !subset_dataset_id = GDS100 !subset_description = irradiated !subset_sample_id = GSM544,GSM545,GSM546,GSM547,GSM548 !subset_type = protocol ^SUBSET = GDS100_2 !subset_dataset_id = GDS100 !subset_description = not irradiated !subset_sample_id = GSM542,GSM543,GSM549 !subset_type = protocol ^SUBSET = GDS100_3 !subset_dataset_id = GDS100 !subset_description = 5 minute !subset_sample_id = GSM547 !subset_type = time ^SUBSET = GDS100_4 !subset_dataset_id = GDS100 !subset_description = 10 minute !subset_sample_id = GSM544 !subset_type = time ^SUBSET = GDS100_5 !subset_dataset_id = GDS100 !subset_description = 20 minute !subset_sample_id = GSM545,GSM542 !subset_type = time ^SUBSET = GDS100_6 !subset_dataset_id = GDS100 !subset_description = 40 minute !subset_sample_id = GSM546 !subset_type = time ^SUBSET = GDS100_7 !subset_dataset_id = GDS100 !subset_description = 60 minute !subset_sample_id = GSM548,GSM543 !subset_type = time ^SUBSET = GDS100_8 !subset_dataset_id = GDS100 !subset_description = 0 minute !subset_sample_id = GSM549 !subset_type = time ^DATASET = GDS100 #ID_REF = Platform reference identifier #IDENTIFIER = identifier #GSM549 = Value for GSM549: lexA vs. wt, before UV treatment, MG1655; +src: 0' wt, before UV treatment, 25 ug total RNA, 2 ug pdN6; src: 0' +lexA, before UV 25 ug total RNA, 2 ug pdN6 #GSM542 = Value for GSM542: lexA 20' after NOuv vs. 0', MG1655; src: 0 +', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min +after NOuv, 25 ug total RNA, 2 ug pdN6 #GSM543 = Value for GSM543: lexA 60' after NOuv vs. 0', MG1655; src: 0 +', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min +after NOuv, 25 ug total RNA, 2 ug pdN6 #GSM547 = Value for GSM547: lexA 5' after UV vs. 0', MG1655; src: 0', +before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 5 min afte +r UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM544 = Value for GSM544: lexA 10' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 10 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM545 = Value for GSM545: lexA 20' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM546 = Value for GSM546: lexA 40' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 40 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM548 = Value for GSM548: lexA 60' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 !dataset_table_begin ID_REF IDENTIFIER GSM549 GSM542 GSM543 GSM547 GSM544 + GSM545 GSM546 GSM548 1 EMPTY 0.211 0.240 0.306 0.098 0.101 0.208 0. +167 0.190 2 EMPTY 0.045 0.097 0.142 0.107 0.074 0.202 0. +019 0.266 3 EMPTY 0.191 0.243 0.312 0.023 0.158 0.261 0. +255 0.128 4 EMPTY -0.013 -0.041 0.112 -0.028 0.175 0.111 + 0.139 0.137 etc
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How would you do this?
by BrowserUk (Patriarch) on Jun 13, 2010 at 13:34 UTC | |
by david_lyon (Sexton) on Jun 13, 2010 at 15:47 UTC | |
by BrowserUk (Patriarch) on Jun 13, 2010 at 15:51 UTC | |
by david_lyon (Sexton) on Jun 13, 2010 at 16:24 UTC |