How would you do this?

david_lyon has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How would you do this? by JavaFan (Canon) on Jun 13, 2010 at 02:08 UTC
I'd start by removing all the unnecessary backslashes, then look at the command string again, and see what it actually does. After that, I may rewrite it into pure Perl, but that sounds too much like work - if it works as it, why bother? But lose the redundant backslashes. That alone would remove more than half of the uglyness factor. Oh, and for the next time, come up with a better title.	[reply]
Re: How would you do this? by Utilitarian (Vicar) on Jun 13, 2010 at 09:15 UTC
Hi David, If I understand you correctly, you want to unzip the files and then split them so that you have unique name=value for all lines starting with !dataset_ ie:`!dataset_$name=$value` IO::Uncompress::Gunzip would allow you to access the data within your Perl program, you could then treat them as ordinary files and check they start with the correct marker and if they do extract the required data. Hope that helps `print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."`	[reply] [d/l] [select]
Re: How would you do this? by Khen1950fx (Canon) on Jun 13, 2010 at 04:24 UTC
Your script doesn't do anything. First, you didn't declare $cmd. That should be `my $cmd =` . Second, `"zcat *gz"`: What do you want it to do? Did you have an option in mind? More information there would be helpful. Third, since you're running system commands, you'll need to use perl's "system" command. The print command would come after that. Hence, `#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $cmd = system("zcat -l /path/to/.gz/files"); print Dumper($cmd);` [download]	[reply] [d/l] [select]
Re^2: How would you do this? by ikegami (Patriarch) on Jun 13, 2010 at 05:42 UTC
Second, "zcat gz": What do you want it to do?* `$ zcat foo.gz bar` [download] you'll need to use perl's "system" command He is using `system`. If anything, he shouldn't be using `system` as it's forcing him to use a temporary file (assuming he's simply creating `VARIABLES` in order to read it in).	[reply] [d/l] [select]
Re^2: How would you do this? by JavaFan (Canon) on Jun 13, 2010 at 16:10 UTC
First, you didn't declare $cmd. That should be `my $cmd = .` Uhm, no. This is Perl, remember. You don't have to declare anything. Besides, the OP was showing a code snippet. He might as well have your precious "my $cmd" somewhere in the code he (rightfully so) did not post. Second, "zcat gz": What do you want it to do? Did you have an option in mind? More information there would be helpful.* First, you misquote. It's `zcat .gz`. Second, it's quite obvious that he wants to run `zcat` on all the gzipped files in the current directory. No need to further explain that. Third, since you're running system commands, you'll need to use perl's "system" command.* That's neither correct, nor is your remark useful. `system` is not the only way to run external commands (backticks, `exec` and `open` do as well), but the OP is using `system`. The print command would come after that. Really? The OP is just printing the command to be executed. There's no need to first execute the command, than to print it. `my $cmd = system("zcat -l /path/to/.gz/files"); print Dumper($cmd);` [download] Now, that's a program that does little useful. Unlike the code of the OP that actually does something with the content of the uncompressed file - yours just dumps it to the screen. And then you use `Data::Dumper` to print out the value of an integer. I'd say all your suggestions are utterly rubbish and worthless.	[reply] [d/l] [select]
Re: How would you do this? by david_lyon (Sexton) on Jun 13, 2010 at 13:27 UTC
Bows... Thanks for everyone help just to add a bit more information. I am downloading compressed files which has the format like this below and parsing each of those files like this below: Bows Thank alot again. `$cmd = "zcat .gz \\| grep -P \'\\!dataset\' \\| grep -P \'\\=\' \\| perl -pi - +e s\'\/^\\!dataset\\_\\|\\=\.\/\/g\' \| sort \| uniq >VARIABLES"; print "$cmd\n"; system($cmd);` [download] Contents of VARIABLES would be everything between "!dataset_" and " = .*" Contents of compressed files: ^DATABASE = Geo !Database_name = Gene Expression Omnibus (GEO) !Database_institute = NCBI NLM NIH !Database_web_link = http://www.ncbi.nlm.nih.gov/projects/geo !Database_email = geo@ncbi.nlm.nih.gov !Database_ref = Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6 ^DATASET = GDS100 !dataset_title = UV exposure time course (ecoli_8.0) !dataset_description = Time course of UV-responsive genes and their ro +le in cellular recovery. lexA SOS-deficient strains analyzed. !dataset_type = gene expression array-based !dataset_pubmed_id = 11333217 !dataset_platform = GPL18 !dataset_platform_organism = Escherichia coli !dataset_platform_technology_type = spotted DNA/cDNA !dataset_feature_count = 5764 !dataset_sample_organism = Escherichia coli !dataset_sample_type = RNA !dataset_channel_count = 2 !dataset_sample_count = 8 !dataset_value_type = log ratio !dataset_reference_series = GSE9 !dataset_order = none !dataset_update_date = Apr 06 2003 ^SUBSET = GDS100_1 !subset_dataset_id = GDS100 !subset_description = irradiated !subset_sample_id = GSM544,GSM545,GSM546,GSM547,GSM548 !subset_type = protocol ^SUBSET = GDS100_2 !subset_dataset_id = GDS100 !subset_description = not irradiated !subset_sample_id = GSM542,GSM543,GSM549 !subset_type = protocol ^SUBSET = GDS100_3 !subset_dataset_id = GDS100 !subset_description = 5 minute !subset_sample_id = GSM547 !subset_type = time ^SUBSET = GDS100_4 !subset_dataset_id = GDS100 !subset_description = 10 minute !subset_sample_id = GSM544 !subset_type = time ^SUBSET = GDS100_5 !subset_dataset_id = GDS100 !subset_description = 20 minute !subset_sample_id = GSM545,GSM542 !subset_type = time ^SUBSET = GDS100_6 !subset_dataset_id = GDS100 !subset_description = 40 minute !subset_sample_id = GSM546 !subset_type = time ^SUBSET = GDS100_7 !subset_dataset_id = GDS100 !subset_description = 60 minute !subset_sample_id = GSM548,GSM543 !subset_type = time ^SUBSET = GDS100_8 !subset_dataset_id = GDS100 !subset_description = 0 minute !subset_sample_id = GSM549 !subset_type = time ^DATASET = GDS100 #ID_REF = Platform reference identifier #IDENTIFIER = identifier #GSM549 = Value for GSM549: lexA vs. wt, before UV treatment, MG1655; +src: 0' wt, before UV treatment, 25 ug total RNA, 2 ug pdN6; src: 0' +lexA, before UV 25 ug total RNA, 2 ug pdN6 #GSM542 = Value for GSM542: lexA 20' after NOuv vs. 0', MG1655; src: 0 +', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min +after NOuv, 25 ug total RNA, 2 ug pdN6 #GSM543 = Value for GSM543: lexA 60' after NOuv vs. 0', MG1655; src: 0 +', before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min +after NOuv, 25 ug total RNA, 2 ug pdN6 #GSM547 = Value for GSM547: lexA 5' after UV vs. 0', MG1655; src: 0', +before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 5 min afte +r UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM544 = Value for GSM544: lexA 10' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 10 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM545 = Value for GSM545: lexA 20' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 20 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM546 = Value for GSM546: lexA 40' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 40 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 #GSM548 = Value for GSM548: lexA 60' after UV vs. 0', MG1655; src: 0', + before UV treatment, 25 ug total RNA, 2 ug pdN6; src: lexA 60 min af +ter UV treatment, 25 ug total RNA, 2 ug pdN6 !dataset_table_begin ID_REF IDENTIFIER GSM549 GSM542 GSM543 GSM547 GSM544 + GSM545 GSM546 GSM548 1 EMPTY 0.211 0.240 0.306 0.098 0.101 0.208 0. +167 0.190 2 EMPTY 0.045 0.097 0.142 0.107 0.074 0.202 0. +019 0.266 3 EMPTY 0.191 0.243 0.312 0.023 0.158 0.261 0. +255 0.128 4 EMPTY -0.013 -0.041 0.112 -0.028 0.175 0.111 + 0.139 0.137 etc [download]	[reply] [d/l] [select]
Re^2: How would you do this? by BrowserUk (Patriarch) on Jun 13, 2010 at 13:34 UTC
Untested, but this might get you close: `#! perl -slw use strict; open ZCAT, '\|-', 'zcat *.gz' or die; my %uniq; while( <ZCAT> ) { my( $varname ) = m[!dataset_([^=]+)=] or next; $uniq{ $varname } = 1; } close ZCAT; print "uniq names sorted"; print for sort keys %uniq;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^3: How would you do this? by david_lyon (Sexton) on Jun 13, 2010 at 15:47 UTC
Thanks BrowserUk The parsing works nicely...thanks. Only 1 issue how to I stop this line from spiting out the contents of the files to the terminal or it is something in my setup? It doesnt pass the data to ZCAT it just spits it out to the terminal. `open ZCAT, '\|-', 'zcat *.gz' or die;` [download] Thanks once again	[reply] [d/l]
Re^4: How would you do this? by BrowserUk (Patriarch) on Jun 13, 2010 at 15:51 UTC
Re^5: How would you do this? by david_lyon (Sexton) on Jun 13, 2010 at 16:24 UTC