Bows...
Thanks for everyone help just to add a bit more information. I am downloading compressed files which has the format like this below and parsing each of those files like this below:
Bows Thank alot again.
$cmd =
"zcat *.gz \| grep -P \'\\!dataset\' \| grep -P \'\\=\' \| perl -pi -
+e s\'\/^\\!dataset\\_\|\\=\.*\/\/g\' | sort | uniq >VARIABLES";
print "$cmd\n";
system($cmd);
Contents of VARIABLES would be everything between
"!dataset_" and " = .*"
Contents of compressed files:
^DATABASE = Geo
!Database_name = Gene Expression Omnibus (GEO)
!Database_institute = NCBI NLM NIH
!Database_web_link = http://www.ncbi.nlm.nih.gov/projects/geo
!Database_email = geo@ncbi.nlm.nih.gov
!Database_ref = Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6
^DATASET = GDS100
!dataset_title = UV exposure time course (ecoli_8.0)
!dataset_description = Time course of UV-responsive genes and their ro
+le in cellular recovery. lexA SOS-deficient strains analyzed.
!dataset_type = gene expression array-based
!dataset_pubmed_id = 11333217
!dataset_platform = GPL18
!dataset_platform_organism = Escherichia coli
!dataset_platform_technology_type = spotted DNA/cDNA
!dataset_feature_count = 5764
!dataset_sample_organism = Escherichia coli
!dataset_sample_type = RNA
!dataset_channel_count = 2
!dataset_sample_count = 8
!dataset_value_type = log ratio
!dataset_reference_series = GSE9
!dataset_order = none
!dataset_update_date = Apr 06 2003
^SUBSET = GDS100_1
!subset_dataset_id = GDS100
!subset_description = irradiated
!subset_sample_id = GSM544,GSM545,GSM546,GSM547,GSM548
!subset_type = protocol
^SUBSET = GDS100_2
!subset_dataset_id = GDS100
!subset_description = not irradiated
!subset_sample_id = GSM542,GSM543,GSM549
!subset_type = protocol
^SUBSET = GDS100_3
!subset_dataset_id = GDS100
!subset_description = 5 minute
!subset_sample_id = GSM547
!subset_type = time
^SUBSET = GDS100_4
!subset_dataset_id = GDS100
!subset_description = 10 minute
!subset_sample_id = GSM544
!subset_type = time
^SUBSET = GDS100_5
!subset_dataset_id = GDS100
!subset_description = 20 minute
!subset_sample_id = GSM545,GSM542
!subset_type = time
^SUBSET = GDS100_6
!subset_dataset_id = GDS100
!subset_description = 40 minute
!subset_sample_id = GSM546
!subset_type = time
^SUBSET = GDS100_7
!subset_dataset_id = GDS100
!subset_description = 60 minute
!subset_sample_id = GSM548,GSM543
!subset_type = time
^SUBSET = GDS100_8
!subset_dataset_id = GDS100
!subset_description = 0 minute
!subset_sample_id = GSM549
!subset_type = time
^DATASET = GDS100
#ID_REF = Platform reference identifier
#IDENTIFIER = identifier
#GSM549 = Value for GSM549: lexA vs. wt, before UV treatment, MG1655;
+src: 0' wt, before UV treatment, 25 ug total RNA, 2 ug pdN6; src: 0'
+lexA, before UV 25 ug total RNA, 2 ug pdN6
#GSM542 = Value for GSM542: lexA 20' after NOuv vs. 0', MG1655; src: 0
+', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min
+after NOuv, 25 ug total RNA, 2 ug pdN6
#GSM543 = Value for GSM543: lexA 60' after NOuv vs. 0', MG1655; src: 0
+', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min
+after NOuv, 25 ug total RNA, 2 ug pdN6
#GSM547 = Value for GSM547: lexA 5' after UV vs. 0', MG1655; src: 0',
+before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 5 min afte
+r UV treatment, 25 ug total RNA, 2 ug pdN6
#GSM544 = Value for GSM544: lexA 10' after UV vs. 0', MG1655; src: 0',
+ before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 10 min af
+ter UV treatment, 25 ug total RNA, 2 ug pdN6
#GSM545 = Value for GSM545: lexA 20' after UV vs. 0', MG1655; src: 0',
+ before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min af
+ter UV treatment, 25 ug total RNA, 2 ug pdN6
#GSM546 = Value for GSM546: lexA 40' after UV vs. 0', MG1655; src: 0',
+ before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 40 min af
+ter UV treatment, 25 ug total RNA, 2 ug pdN6
#GSM548 = Value for GSM548: lexA 60' after UV vs. 0', MG1655; src: 0',
+ before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min af
+ter UV treatment, 25 ug total RNA, 2 ug pdN6
!dataset_table_begin
ID_REF IDENTIFIER GSM549 GSM542 GSM543 GSM547 GSM544
+ GSM545 GSM546 GSM548
1 EMPTY 0.211 0.240 0.306 0.098 0.101 0.208 0.
+167 0.190
2 EMPTY 0.045 0.097 0.142 0.107 0.074 0.202 0.
+019 0.266
3 EMPTY 0.191 0.243 0.312 0.023 0.158 0.261 0.
+255 0.128
4 EMPTY -0.013 -0.041 0.112 -0.028 0.175 0.111
+ 0.139 0.137
etc
|