in reply to Removing white space from the file
G'day GSperlbio,
"i need to remove all the white spaces and numbers and retain only the characters to the same file without writing to another file.."
Before embarking on destructive modifications, I'd recommend that you make a backup of the original data. You can then use the backup as a read-only source and continually and safely overwrite the original. The basic operations would look something like this:
copy uuu.txt to backup_uuu.txt read data from backup_uuu.txt modify data write to uuu.txt check uuu.txt if uuu.txt looks good: modifications done! else: read data from backup_uuu.txt modify data (in some improved way) write to uuu.txt check uuu.txt ... (repeating until uuu.txt looks good) delete backup_uuu.txt (or keep for historical purposes)
Next you need to answer some questions about the data itself. Because you've marked up the the data in plain HTML, we can't tell if uuu.txt contains a single record:
1 TCCAAGGATA ... 61 GAGGGCTTTT ... 121 CAAGTCTTTC ...
or multiple records, e.g.
1 TCCAAGGATA ... 61 GAGGGCTTTT ... 121 CAAGTCTTTC ...
[For this reason, please always markup your data within <code>...</code> tags (as you've done with the code itself).]
And, as a logical extension to this, should your output be a single record or multiple records?
Given your data appears to be sequences of nucleotide bases (interspersed with positional numbers), it could potentially be very large. This may well affect the appropriateness of any given solution. What sort of size is uuu.txt?
Is uuu.txt just a single file you need to deal with or is it an example of one of many files?
"i have tried this code to remove only the white space but i didnt get the expected result can anyone help me to improve this??"
You haven't shown the result you got nor the result you expected. This makes suggesting improvements somewhat tricky.
However, having said that, it's clear from the code you've posted that you haven't really understood what the open function does. In brief:
Take a look at "perlintro: Files and I/O" for the very basics; then follow the links in that section for more details.
On to solutions:
An appropriate solution will depend very much on how you answered the earlier questions regarding your data.
When formulating a character class (see "perlrecharclass: Bracketed Character Classes"), consider a negated whitelist rather than attempting to generate a blacklist. If you're working with just DNA, you only want to keep [ACGT]: in other words, you want to remove everything which matches [^ACGT] (no need to worry about whitespace matching newlines, carriage returns, spaces, tabs, and so on). [[^ACGU] for RNA or [^ACGTU] for both DNA and RNA.]
For a one-off solution (with a smallish file), ++vinoth.ree's one-liner may well be appropriate; although, using the whitelist:
s/[^ACGT]//g
For a very large file, you may find transliteration is more efficient than regex substitution:
y/ACGT//cd
[See "perlperf: Perl Performance and Optimization Techniques: Search and replace or tr" for details (note: y and tr are synonyms) and Benchmark to check for yourself.]
For multiple files, an actual script would be better. The code to make the changes would be much the same. You'll need to do the I/O yourself: see earlier notes about this.
Another solution, which I suspect is probably inappropriate, but mentioned for completeness, is Tie::File. This effectively allows direct editing of disk files but, given the data I envisage you're working with, is likely to be horribly slow.
That may be sufficient information for you to complete your task. Feel free to ask if further information or general help is required.
— Ken
|
|---|