GSperlbio has asked for the wisdom of the Perl Monks concerning the following question:
Im having a file named uuu.txt and the contents of the file is a mixture of string numbers and white spaces. for example: 1 TCCAAGGATA AGTATGTAAA TACGGGGCGG GCTCTGGGAG GGGAGAGACT TTACAAAAAT 61 GAGGGCTTTT ATTTTCCATT TGGAACGTGG GACAACAGAC CACAACGCAA TTCCATTTTG 121 CAAGTCTTTC CAAGGGAGAA GCTGTTCAAC CACCCGTTTG GGGGATGAGT GAGCCGACAC i need to remove all the white spaces and numbers and retain only the characters to the same file without writing to another file..
i have tried this code to remove only the white space but i didnt get the expected result can anyone help me to improve this??#!/usr/bin/perl use strict; use warnings; open(my $fh, ">> uuu.txt") || die "cant open the file"; chomp $fh; $fh =~ s/\s//g; close $fh;
|
|---|
| Replies are listed 'Best First'. | |||
|---|---|---|---|
|
Re: Removing white space from the file
by vinoth.ree (Monsignor) on Aug 12, 2015 at 06:01 UTC | |||
You can do it in command line itself, perl -p -i -e "s/[0-9\s]//g" uuu.txt
Options:
-p processes, then prints <> line by line
-i activates in-place editing.
The regex substitution acts on the implicit variable, which are the contents of the file, line-by-line
Read more on perlrun for each options. All is well. I learn by answering your questions... | [reply] [d/l] | ||
|
Re: Removing white space from the file
by kcott (Archbishop) on Aug 12, 2015 at 10:20 UTC | |||
G'day GSperlbio, "i need to remove all the white spaces and numbers and retain only the characters to the same file without writing to another file.." Before embarking on destructive modifications, I'd recommend that you make a backup of the original data. You can then use the backup as a read-only source and continually and safely overwrite the original. The basic operations would look something like this:
Next you need to answer some questions about the data itself. Because you've marked up the the data in plain HTML, we can't tell if uuu.txt contains a single record:
or multiple records, e.g.
[For this reason, please always markup your data within <code>...</code> tags (as you've done with the code itself).] And, as a logical extension to this, should your output be a single record or multiple records? Given your data appears to be sequences of nucleotide bases (interspersed with positional numbers), it could potentially be very large. This may well affect the appropriateness of any given solution. What sort of size is uuu.txt? Is uuu.txt just a single file you need to deal with or is it an example of one of many files? "i have tried this code to remove only the white space but i didnt get the expected result can anyone help me to improve this??" You haven't shown the result you got nor the result you expected. This makes suggesting improvements somewhat tricky. However, having said that, it's clear from the code you've posted that you haven't really understood what the open function does. In brief:
Take a look at "perlintro: Files and I/O" for the very basics; then follow the links in that section for more details. On to solutions: An appropriate solution will depend very much on how you answered the earlier questions regarding your data. When formulating a character class (see "perlrecharclass: Bracketed Character Classes"), consider a negated whitelist rather than attempting to generate a blacklist. If you're working with just DNA, you only want to keep [ACGT]: in other words, you want to remove everything which matches [^ACGT] (no need to worry about whitespace matching newlines, carriage returns, spaces, tabs, and so on). [[^ACGU] for RNA or [^ACGTU] for both DNA and RNA.] For a one-off solution (with a smallish file), ++vinoth.ree's one-liner may well be appropriate; although, using the whitelist:
For a very large file, you may find transliteration is more efficient than regex substitution:
[See "perlperf: Perl Performance and Optimization Techniques: Search and replace or tr" for details (note: y and tr are synonyms) and Benchmark to check for yourself.] For multiple files, an actual script would be better. The code to make the changes would be much the same. You'll need to do the I/O yourself: see earlier notes about this. Another solution, which I suspect is probably inappropriate, but mentioned for completeness, is Tie::File. This effectively allows direct editing of disk files but, given the data I envisage you're working with, is likely to be horribly slow. That may be sufficient information for you to complete your task. Feel free to ask if further information or general help is required. — Ken | [reply] [d/l] [select] | ||
|
Re: Removing white space from the file
by Monk::Thomas (Friar) on Aug 12, 2015 at 07:59 UTC | |||
You can not apply a substitution to a file handle. (Well you can, but it does not modify the actual file content.) Another problem is:
...which would open the file for appending, but NOT for reading. You did not specify whether you want an in-place edit of the file or whether you just want to convert the content for further processing. (For an in-place edit I would actually prefer to use something like 'sed -i~ s/[0-9 ]//g uuu.txt' instead of perl, because I get a backup copy of the original file for free.) | [reply] [d/l] [select] | ||
|
Re: Removing white space from the file
by Athanasius (Archbishop) on Aug 12, 2015 at 08:06 UTC | |||
But note that this will not remove newlines. Hope that helps,
| [reply] [d/l] | ||
|
Re: Removing white space from the file
by Laurent_R (Canon) on Aug 12, 2015 at 08:00 UTC | |||
So you would need something like this: However, this being Perl, you have some command line options which will do the boiler plate code (opening and closing the files, reading input line by line, and file renaming) for you behind the scene, so that this one-liner (or the almost identical one proposed by vinoth.ree) will do everything for you:
| [reply] [d/l] [select] | ||
|
Re: Removing white space from the file
by marinersk (Priest) on Aug 12, 2015 at 12:34 UTC | |||
While there's lots of good information in the previous answers about how to go about doing this, I tried to modify your script as little as possible. Notes follow, as the changes needed were fairly extensive.
This adjusts your script to remove the whitespace, but has not yet added code to remove the numerical characters. I think you can probably handle that on your own, based on the code you've already generated. Notes
Good luck! | [reply] [d/l] [select] | ||
|
Re: Removing white space from the file
by anonymized user 468275 (Curate) on Aug 12, 2015 at 11:50 UTC | |||
1) (assuming you have very limited disk-space, but enough memory to read in file) - read the file in, remove spaces in memory then write it back 2) (not enough memory but say 20% margin of temporary disk space) - for each line read in, remove spaces and write to a compressed file, perhaps on /tmp if it has the space. - uncompress new file over old - remove compressed file
Although if there is no such resource problem, better say what the reason is! One world, one people | [reply] [d/l] [select] | ||
|
Re: Removing white space from the file
by crusty_collins (Friar) on Aug 12, 2015 at 14:04 UTC | |||
| [reply] [d/l] | ||