Re: How would you do this?

Bows... Thanks for everyone help just to add a bit more information. I am downloading compressed files which has the format like this below and parsing each of those files like this below: Bows Thank alot again.

$cmd =
"zcat *.gz \| grep -P \'\\!dataset\' \| grep -P \'\\=\'  \| perl -pi -
+e s\'\/^\\!dataset\\_\|\\=\.*\/\/g\' | sort | uniq  >VARIABLES";
print "$cmd\n";
system($cmd);
[download]

Contents of VARIABLES would be everything between "!dataset_" and " = .*"

Contents of compressed files:

^DATABASE = Geo
!Database_name = Gene Expression Omnibus (GEO)
!Database_institute = NCBI NLM NIH
!Database_web_link = http://www.ncbi.nlm.nih.gov/projects/geo
!Database_email = geo@ncbi.nlm.nih.gov
!Database_ref = Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6
^DATASET = GDS100
!dataset_title = UV exposure time course (ecoli_8.0)
!dataset_description = Time course of UV-responsive genes and their ro
+le in cellular recovery. lexA SOS-deficient strains analyzed.
!dataset_type = gene expression array-based
!dataset_pubmed_id = 11333217
!dataset_platform = GPL18
!dataset_platform_organism = Escherichia coli
!dataset_platform_technology_type = spotted DNA/cDNA
!dataset_feature_count = 5764
!dataset_sample_organism = Escherichia coli
!dataset_sample_type = RNA
!dataset_channel_count = 2
!dataset_sample_count = 8
!dataset_value_type = log ratio
!dataset_reference_series = GSE9
!dataset_order = none
!dataset_update_date = Apr 06 2003
^SUBSET = GDS100_1
!subset_dataset_id = GDS100
!subset_description = irradiated
!subset_sample_id = GSM544,GSM545,GSM546,GSM547,GSM548
!subset_type = protocol
^SUBSET = GDS100_2
!subset_dataset_id = GDS100
!subset_description = not irradiated
!subset_sample_id = GSM542,GSM543,GSM549
!subset_type = protocol
^SUBSET = GDS100_3
!subset_dataset_id = GDS100
!subset_description = 5 minute
!subset_sample_id = GSM547
!subset_type = time
^SUBSET = GDS100_4
!subset_dataset_id = GDS100
!subset_description = 10 minute
!subset_sample_id = GSM544
!subset_type = time
^SUBSET = GDS100_5
!subset_dataset_id = GDS100
!subset_description = 20 minute
!subset_sample_id = GSM545,GSM542
!subset_type = time
^SUBSET = GDS100_6
!subset_dataset_id = GDS100
!subset_description = 40 minute
!subset_sample_id = GSM546
!subset_type = time
^SUBSET = GDS100_7
!subset_dataset_id = GDS100
!subset_description = 60 minute
!subset_sample_id = GSM548,GSM543
!subset_type = time
^SUBSET = GDS100_8
!subset_dataset_id = GDS100
!subset_description = 0 minute
!subset_sample_id = GSM549
!subset_type = time
^DATASET = GDS100
#ID_REF = Platform reference identifier
#IDENTIFIER = identifier
#GSM549 = Value for GSM549: lexA vs. wt, before UV treatment, MG1655; 
+src: 0' wt, before UV treatment, 25 ug total RNA, 2 ug pdN6; src: 0' 
+lexA, before UV 25 ug total RNA, 2 ug pdN6
#GSM542 = Value for GSM542: lexA 20' after NOuv vs. 0', MG1655; src: 0
+', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min 
+after NOuv, 25 ug total RNA, 2 ug pdN6
#GSM543 = Value for GSM543: lexA 60' after NOuv vs. 0', MG1655; src: 0
+', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min 
+after NOuv, 25 ug total RNA, 2 ug pdN6
#GSM547 = Value for GSM547: lexA 5' after UV vs. 0', MG1655; src: 0', 
+before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 5 min afte
+r UV treatment, 25 ug total RNA, 2 ug pdN6
#GSM544 = Value for GSM544: lexA 10' after UV vs. 0', MG1655; src: 0',
+ before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 10 min af
+ter UV treatment, 25 ug total RNA, 2 ug pdN6
#GSM545 = Value for GSM545: lexA 20' after UV vs. 0', MG1655; src: 0',
+ before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min af
+ter UV treatment, 25 ug total RNA, 2 ug pdN6
#GSM546 = Value for GSM546: lexA 40' after UV vs. 0', MG1655; src: 0',
+ before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 40 min af
+ter UV treatment, 25 ug total RNA, 2 ug pdN6
#GSM548 = Value for GSM548: lexA 60' after UV vs. 0', MG1655; src: 0',
+ before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min af
+ter UV treatment, 25 ug total RNA, 2 ug pdN6
!dataset_table_begin
ID_REF    IDENTIFIER    GSM549    GSM542    GSM543    GSM547    GSM544
+    GSM545    GSM546    GSM548
1    EMPTY    0.211    0.240    0.306    0.098    0.101    0.208    0.
+167    0.190
2    EMPTY    0.045    0.097    0.142    0.107    0.074    0.202    0.
+019    0.266
3    EMPTY    0.191    0.243    0.312    0.023    0.158    0.261    0.
+255    0.128
4    EMPTY    -0.013    -0.041    0.112    -0.028    0.175    0.111   
+ 0.139    0.137
etc
[download]

Comment on Re: How would you do this? Select or Download Code

Replies are listed 'Best First'.
Re^2: How would you do this? by BrowserUk (Patriarch) on Jun 13, 2010 at 13:34 UTC
Untested, but this might get you close: `#! perl -slw use strict; open ZCAT, '\|-', 'zcat *.gz' or die; my %uniq; while( <ZCAT> ) { my( $varname ) = m[!dataset_([^=]+)=] or next; $uniq{ $varname } = 1; } close ZCAT; print "uniq names sorted"; print for sort keys %uniq;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^3: How would you do this? by david_lyon (Sexton) on Jun 13, 2010 at 15:47 UTC
Thanks BrowserUk The parsing works nicely...thanks. Only 1 issue how to I stop this line from spiting out the contents of the files to the terminal or it is something in my setup? It doesnt pass the data to ZCAT it just spits it out to the terminal. `open ZCAT, '\|-', 'zcat *.gz' or die;` [download] Thanks once again	[reply] [d/l]
Re^4: How would you do this? by BrowserUk (Patriarch) on Jun 13, 2010 at 15:51 UTC
Sorry, that should be: `open ZCAT, '-\|', 'zcat *.gz' or die;` [download] Interesting how you tested the parsing without knowing how to correct that? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^5: How would you do this? by david_lyon (Sexton) on Jun 13, 2010 at 16:24 UTC