in reply to How would you do this?

Bows... Thanks for everyone help just to add a bit more information. I am downloading compressed files which has the format like this below and parsing each of those files like this below: Bows Thank alot again.

$cmd = "zcat *.gz \| grep -P \'\\!dataset\' \| grep -P \'\\=\' \| perl -pi - +e s\'\/^\\!dataset\\_\|\\=\.*\/\/g\' | sort | uniq >VARIABLES"; print "$cmd\n"; system($cmd);

Contents of VARIABLES would be everything between "!dataset_" and " = .*"

Contents of compressed files:

^DATABASE = Geo !Database_name = Gene Expression Omnibus (GEO) !Database_institute = NCBI NLM NIH !Database_web_link = http://www.ncbi.nlm.nih.gov/projects/geo !Database_email = geo@ncbi.nlm.nih.gov !Database_ref = Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6 ^DATASET = GDS100 !dataset_title = UV exposure time course (ecoli_8.0) !dataset_description = Time course of UV-responsive genes and their ro +le in cellular recovery. lexA SOS-deficient strains analyzed. !dataset_type = gene expression array-based !dataset_pubmed_id = 11333217 !dataset_platform = GPL18 !dataset_platform_organism = Escherichia coli !dataset_platform_technology_type = spotted DNA/cDNA !dataset_feature_count = 5764 !dataset_sample_organism = Escherichia coli !dataset_sample_type = RNA !dataset_channel_count = 2 !dataset_sample_count = 8 !dataset_value_type = log ratio !dataset_reference_series = GSE9 !dataset_order = none !dataset_update_date = Apr 06 2003 ^SUBSET = GDS100_1 !subset_dataset_id = GDS100 !subset_description = irradiated !subset_sample_id = GSM544,GSM545,GSM546,GSM547,GSM548 !subset_type = protocol ^SUBSET = GDS100_2 !subset_dataset_id = GDS100 !subset_description = not irradiated !subset_sample_id = GSM542,GSM543,GSM549 !subset_type = protocol ^SUBSET = GDS100_3 !subset_dataset_id = GDS100 !subset_description = 5 minute !subset_sample_id = GSM547 !subset_type = time ^SUBSET = GDS100_4 !subset_dataset_id = GDS100 !subset_description = 10 minute !subset_sample_id = GSM544 !subset_type = time ^SUBSET = GDS100_5 !subset_dataset_id = GDS100 !subset_description = 20 minute !subset_sample_id = GSM545,GSM542 !subset_type = time ^SUBSET = GDS100_6 !subset_dataset_id = GDS100 !subset_description = 40 minute !subset_sample_id = GSM546 !subset_type = time ^SUBSET = GDS100_7 !subset_dataset_id = GDS100 !subset_description = 60 minute !subset_sample_id = GSM548,GSM543 !subset_type = time ^SUBSET = GDS100_8 !subset_dataset_id = GDS100 !subset_description = 0 minute !subset_sample_id = GSM549 !subset_type = time ^DATASET = GDS100 #ID_REF = Platform reference identifier #IDENTIFIER = identifier #GSM549 = Value for GSM549: lexA vs. wt, before UV treatment, MG1655; +src: 0' wt, before UV treatment, 25 ug total RNA, 2 ug pdN6; src: 0' +lexA, before UV 25 ug total RNA, 2 ug pdN6 #GSM542 = Value for GSM542: lexA 20' after NOuv vs. 0', MG1655; src: 0 +', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min +after NOuv, 25 ug total RNA, 2 ug pdN6 #GSM543 = Value for GSM543: lexA 60' after NOuv vs. 0', MG1655; src: 0 +', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min +after NOuv, 25 ug total RNA, 2 ug pdN6 #GSM547 = Value for GSM547: lexA 5' after UV vs. 0', MG1655; src: 0', +before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 5 min afte +r UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM544 = Value for GSM544: lexA 10' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 10 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM545 = Value for GSM545: lexA 20' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM546 = Value for GSM546: lexA 40' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 40 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM548 = Value for GSM548: lexA 60' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 !dataset_table_begin ID_REF IDENTIFIER GSM549 GSM542 GSM543 GSM547 GSM544 + GSM545 GSM546 GSM548 1 EMPTY 0.211 0.240 0.306 0.098 0.101 0.208 0. +167 0.190 2 EMPTY 0.045 0.097 0.142 0.107 0.074 0.202 0. +019 0.266 3 EMPTY 0.191 0.243 0.312 0.023 0.158 0.261 0. +255 0.128 4 EMPTY -0.013 -0.041 0.112 -0.028 0.175 0.111 + 0.139 0.137 etc

Replies are listed 'Best First'.
Re^2: How would you do this?
by BrowserUk (Patriarch) on Jun 13, 2010 at 13:34 UTC

    Untested, but this might get you close:

    #! perl -slw use strict; open ZCAT, '|-', 'zcat *.gz' or die; my %uniq; while( <ZCAT> ) { my( $varname ) = m[!dataset_([^=]+)=] or next; $uniq{ $varname } = 1; } close ZCAT; print "uniq names sorted"; print for sort keys %uniq;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks BrowserUk The parsing works nicely...thanks. Only 1 issue how to I stop this line from spiting out the contents of the files to the terminal or it is something in my setup? It doesnt pass the data to ZCAT it just spits it out to the terminal.
      open ZCAT, '|-', 'zcat *.gz' or die;
      Thanks once again

        Sorry, that should be:

        open ZCAT, '-|', 'zcat *.gz' or die;

        Interesting how you tested the parsing without knowing how to correct that?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.