david_lyon has asked for the wisdom of the Perl Monks concerning the following question:

Bows.... I am downloading these files: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/*.gz and parsing with the following ugly ugly command in my perl code, does anyone know of a more elegant perl way to do this? Bows and Thank You!
$cmd = "zcat *.gz \| grep -P \'\\!dataset\' \| grep -P \'\\=\' \| perl -pi - +e s\'\/^\\!dataset\\_\|\\=\.*\/\/g\' | sort | uniq >VARIABLES"; print "$cmd\n"; system($cmd);

Replies are listed 'Best First'.
Re: How would you do this?
by JavaFan (Canon) on Jun 13, 2010 at 02:08 UTC
    I'd start by removing all the unnecessary backslashes, then look at the command string again, and see what it actually does.

    After that, I may rewrite it into pure Perl, but that sounds too much like work - if it works as it, why bother?

    But lose the redundant backslashes. That alone would remove more than half of the uglyness factor.

    Oh, and for the next time, come up with a better title.

Re: How would you do this?
by Utilitarian (Vicar) on Jun 13, 2010 at 09:15 UTC
    Hi David, If I understand you correctly, you want to unzip the files and then split them so that you have unique name=value for all lines starting with !dataset_ ie:!dataset_$name=$value

    IO::Uncompress::Gunzip would allow you to access the data within your Perl program, you could then treat them as ordinary files and check they start with the correct marker and if they do extract the required data.

    Hope that helps

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
Re: How would you do this?
by Khen1950fx (Canon) on Jun 13, 2010 at 04:24 UTC
    Your script doesn't do anything. First, you didn't declare $cmd. That should be my $cmd = . Second, "zcat *gz": What do you want it to do? Did you have an option in mind? More information there would be helpful. Third, since you're running system commands, you'll need to use perl's "system" command. The print command would come after that. Hence,
    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $cmd = system("zcat -l /path/to/.gz/files"); print Dumper($cmd);

      Second, "zcat *gz": What do you want it to do?

      $ zcat foo.gz bar

      you'll need to use perl's "system" command

      He is using system. If anything, he shouldn't be using system as it's forcing him to use a temporary file (assuming he's simply creating VARIABLES in order to read it in).

      First, you didn't declare $cmd. That should be my $cmd = .
      Uhm, no. This is Perl, remember. You don't have to declare anything. Besides, the OP was showing a code snippet. He might as well have your precious "my $cmd" somewhere in the code he (rightfully so) did not post.
      Second, "zcat *gz": What do you want it to do? Did you have an option in mind? More information there would be helpful.
      First, you misquote. It's zcat *.gz. Second, it's quite obvious that he wants to run zcat on all the gzipped files in the current directory. No need to further explain that.
      Third, since you're running system commands, you'll need to use perl's "system" command.
      That's neither correct, nor is your remark useful. system is not the only way to run external commands (backticks, exec and open do as well), but the OP is using system.
      The print command would come after that.
      Really? The OP is just printing the command to be executed. There's no need to first execute the command, than to print it.
      my $cmd = system("zcat -l /path/to/.gz/files"); print Dumper($cmd);
      Now, that's a program that does little useful. Unlike the code of the OP that actually does something with the content of the uncompressed file - yours just dumps it to the screen. And then you use Data::Dumper to print out the value of an integer.

      I'd say all your suggestions are utterly rubbish and worthless.

Re: How would you do this?
by david_lyon (Sexton) on Jun 13, 2010 at 13:27 UTC

    Bows... Thanks for everyone help just to add a bit more information. I am downloading compressed files which has the format like this below and parsing each of those files like this below: Bows Thank alot again.

    $cmd = "zcat *.gz \| grep -P \'\\!dataset\' \| grep -P \'\\=\' \| perl -pi - +e s\'\/^\\!dataset\\_\|\\=\.*\/\/g\' | sort | uniq >VARIABLES"; print "$cmd\n"; system($cmd);

    Contents of VARIABLES would be everything between "!dataset_" and " = .*"

    Contents of compressed files:

    ^DATABASE = Geo !Database_name = Gene Expression Omnibus (GEO) !Database_institute = NCBI NLM NIH !Database_web_link = http://www.ncbi.nlm.nih.gov/projects/geo !Database_email = geo@ncbi.nlm.nih.gov !Database_ref = Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6 ^DATASET = GDS100 !dataset_title = UV exposure time course (ecoli_8.0) !dataset_description = Time course of UV-responsive genes and their ro +le in cellular recovery. lexA SOS-deficient strains analyzed. !dataset_type = gene expression array-based !dataset_pubmed_id = 11333217 !dataset_platform = GPL18 !dataset_platform_organism = Escherichia coli !dataset_platform_technology_type = spotted DNA/cDNA !dataset_feature_count = 5764 !dataset_sample_organism = Escherichia coli !dataset_sample_type = RNA !dataset_channel_count = 2 !dataset_sample_count = 8 !dataset_value_type = log ratio !dataset_reference_series = GSE9 !dataset_order = none !dataset_update_date = Apr 06 2003 ^SUBSET = GDS100_1 !subset_dataset_id = GDS100 !subset_description = irradiated !subset_sample_id = GSM544,GSM545,GSM546,GSM547,GSM548 !subset_type = protocol ^SUBSET = GDS100_2 !subset_dataset_id = GDS100 !subset_description = not irradiated !subset_sample_id = GSM542,GSM543,GSM549 !subset_type = protocol ^SUBSET = GDS100_3 !subset_dataset_id = GDS100 !subset_description = 5 minute !subset_sample_id = GSM547 !subset_type = time ^SUBSET = GDS100_4 !subset_dataset_id = GDS100 !subset_description = 10 minute !subset_sample_id = GSM544 !subset_type = time ^SUBSET = GDS100_5 !subset_dataset_id = GDS100 !subset_description = 20 minute !subset_sample_id = GSM545,GSM542 !subset_type = time ^SUBSET = GDS100_6 !subset_dataset_id = GDS100 !subset_description = 40 minute !subset_sample_id = GSM546 !subset_type = time ^SUBSET = GDS100_7 !subset_dataset_id = GDS100 !subset_description = 60 minute !subset_sample_id = GSM548,GSM543 !subset_type = time ^SUBSET = GDS100_8 !subset_dataset_id = GDS100 !subset_description = 0 minute !subset_sample_id = GSM549 !subset_type = time ^DATASET = GDS100 #ID_REF = Platform reference identifier #IDENTIFIER = identifier #GSM549 = Value for GSM549: lexA vs. wt, before UV treatment, MG1655; +src: 0' wt, before UV treatment, 25 ug total RNA, 2 ug pdN6; src: 0' +lexA, before UV 25 ug total RNA, 2 ug pdN6 #GSM542 = Value for GSM542: lexA 20' after NOuv vs. 0', MG1655; src: 0 +', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min +after NOuv, 25 ug total RNA, 2 ug pdN6 #GSM543 = Value for GSM543: lexA 60' after NOuv vs. 0', MG1655; src: 0 +', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min +after NOuv, 25 ug total RNA, 2 ug pdN6 #GSM547 = Value for GSM547: lexA 5' after UV vs. 0', MG1655; src: 0', +before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 5 min afte +r UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM544 = Value for GSM544: lexA 10' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 10 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM545 = Value for GSM545: lexA 20' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM546 = Value for GSM546: lexA 40' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 40 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM548 = Value for GSM548: lexA 60' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 !dataset_table_begin ID_REF IDENTIFIER GSM549 GSM542 GSM543 GSM547 GSM544 + GSM545 GSM546 GSM548 1 EMPTY 0.211 0.240 0.306 0.098 0.101 0.208 0. +167 0.190 2 EMPTY 0.045 0.097 0.142 0.107 0.074 0.202 0. +019 0.266 3 EMPTY 0.191 0.243 0.312 0.023 0.158 0.261 0. +255 0.128 4 EMPTY -0.013 -0.041 0.112 -0.028 0.175 0.111 + 0.139 0.137 etc

      Untested, but this might get you close:

      #! perl -slw use strict; open ZCAT, '|-', 'zcat *.gz' or die; my %uniq; while( <ZCAT> ) { my( $varname ) = m[!dataset_([^=]+)=] or next; $uniq{ $varname } = 1; } close ZCAT; print "uniq names sorted"; print for sort keys %uniq;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Thanks BrowserUk The parsing works nicely...thanks. Only 1 issue how to I stop this line from spiting out the contents of the files to the terminal or it is something in my setup? It doesnt pass the data to ZCAT it just spits it out to the terminal.
        open ZCAT, '|-', 'zcat *.gz' or die;
        Thanks once again