Data format - delete parts of string and replace strings that match characters partly with numbers.

Renyulb28 has asked for the wisdom of the Perl Monks concerning the following question:

The dataset I just received is horribly formatted, and my limited perl knowledge is not enough to do what I would like, and thus I would like to ask you monks for aid. The dataset is set up in a 5 column format; with the columns as sample ID, mother ID, father ID, sex, and then attribute. The first problem is that some of the ID's have two or three different ID's in one observation since it was uncertain as to which was the true one. Thus, the usual ID of 1293 might be 1293&1295 or 1293&1295&1305. Thus, for this, I would like perl to go through and if it finds a "&" symbol, delete it along with all other strings after it, therefore only leaving the first ID. Right now I've only found how to delete the line if it matches the string, but not only part of the line.

$ perl -ni -e 'print unless /&/' filename
[download]

The second problem is that for the attribute column, it needs to be either 0 for missing, 1 for HCR, or 2 for LCR. Right now the format is either 13HCR-NIH-0 or 13LCR-NIH-0. The numbers in there are arbitrary. What I would like perl to do is if it detects the string "HCR" in a line in column 5, change the entire string to 1, and same for "LCR" and 2. For this I have tried using the find and replace

-p -i.bak -e 's/13HCR-NIH-0/1/g' filename
[download]

but this is way too time consuming as there are too many permutations of the digits.

Thank you for any advise/help

Comment on Data format - delete parts of string and replace strings that match characters partly with numbers. Select or Download Code

Replies are listed 'Best First'.
Re: Data format - delete parts of string and replace strings that match characters partly with numbers. by BrowserUk (Patriarch) on Apr 01, 2011 at 17:53 UTC
For the first part, this should do the job: `perl -pe"s[&[0-9&]+][]g" infile >outfile` [download] For the second part, this might work (assumes the file is comma delimited. Requires a minor change if it is space or tab delimited): `perl -pe"s[,([^,]HCR.$][1]; s[,([^,]LCR.$][2]; " infile >outfile` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Data format - delete parts of string and replace strings that match characters partly with numbers. by jellisii2 (Hermit) on Apr 01, 2011 at 17:53 UTC
Thus, the usual ID of 1293 might be 1293&1295 or 1293&1295&1305. Thus, for this, I would like perl to go through and if it finds a "&" symbol, delete it along with all other strings after it, therefore only leaving the first ID. `print ((split /\&/, $string)[0]);` [download] The second problem is that for the attribute column, it needs to be either 0 for missing, 1 for HCR, or 2 for LCR. Right now the format is either 13HCR-NIH-0 or 13LCR-NIH-0. The numbers in there are arbitrary. What I would like perl to do is if it detects the string "HCR" in a line in column 5, change the entire string to 1, and same for "LCR" and 2. For this I have tried using the find and replace `if (scalar(grep/HRC/, $att) > 0) { $var = 1; } elsif (scalar(grep/LRC/, $att) > 0) { $var = 2; } else { $var = 0; }` [download] Unless there's a a reason for doing one-liners (golfing, for a example), I try my best to avoid it.	[reply] [d/l] [select]
Re: Data format - delete parts of string and replace strings that match characters partly with numbers. by locked_user sundialsvc4 (Abbot) on Apr 01, 2011 at 18:21 UTC
One possibility, for a file like that, is to `split` the data into columns, then work with the columns, then `join` them back into a record. Now you can deal with the individual pieces using regular expressions that are less intimidating (and fragile). Your code should be bristling with data-integrity checks. Explicitly check that the array, after splitting, always contains exactly five elements, and, if there are any other “assertions” that you can make about what “a not-munged record in this file should look like,” you should add code in your program to explicitly check those, too. If it stumbles into any “none of the above ... this should not be happening ...” cases, it should `die` or (Carp) `croak`. In this way, if the program runs to completion as expected, it means something useful: it means not only that the program did what it was supposed to do, but that the incoming file is actually good, and/or that all of your assumptions about what it actually contains were (so far...) correct assumptions.