Removing white space from the file

GSperlbio has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Removing white space from the file
by vinoth.ree (Monsignor) on Aug 12, 2015 at 06:01 UTC

You can do it in command line itself,

perl -p -i -e "s/[0-9\s]//g" uuu.txt

Options:
    -p processes, then prints <> line by line
    -i activates in-place editing.
    The regex substitution acts on the implicit variable, which are the contents of the file, line-by-line

Read more on perlrun for each options.

All is well. I learn by answering your questions...

[reply]
[d/l]

Re: Removing white space from the file
by kcott (Archbishop) on Aug 12, 2015 at 10:20 UTC

G'day GSperlbio,

"i need to remove all the white spaces and numbers and retain only the characters to the same file without writing to another file.."

Before embarking on destructive modifications, I'd recommend that you make a backup of the original data. You can then use the backup as a read-only source and continually and safely overwrite the original. The basic operations would look something like this:

copy uuu.txt to backup_uuu.txt

read data from backup_uuu.txt
modify data
write to uuu.txt

check uuu.txt

if uuu.txt looks good: modifications done!

else:
read data from backup_uuu.txt
modify data (in some improved way)
write to uuu.txt

check uuu.txt

... (repeating until uuu.txt looks good)

delete backup_uuu.txt (or keep for historical purposes)
[download]

Next you need to answer some questions about the data itself. Because you've marked up the the data in plain HTML, we can't tell if uuu.txt contains a single record:

1 TCCAAGGATA ... 61 GAGGGCTTTT ... 121 CAAGTCTTTC ...
[download]

or multiple records, e.g.

1 TCCAAGGATA ... 
61 GAGGGCTTTT ... 
121 CAAGTCTTTC ...
[download]

[For this reason, please always markup your data within <code>...</code> tags (as you've done with the code itself).]

And, as a logical extension to this, should your output be a single record or multiple records?

Given your data appears to be sequences of nucleotide bases (interspersed with positional numbers), it could potentially be very large. This may well affect the appropriateness of any given solution. What sort of size is uuu.txt?

Is uuu.txt just a single file you need to deal with or is it an example of one of many files?

"i have tried this code to remove only the white space but i didnt get the expected result can anyone help me to improve this??"

You haven't shown the result you got nor the result you expected. This makes suggesting improvements somewhat tricky.

However, having said that, it's clear from the code you've posted that you haven't really understood what the open function does. In brief:

open(my $fh, ">> uuu.txt") ...: You're opening the file in append mode: that's for writing only; not reading!.
chomp $fh;: You're attempting to chomp the filehandle; you'd normally chomp the record you'd read (except you've read no records).
$fh =~ s/\s//g;: You're attempting a regex substitution on the filehandle; you'd normally do this on the data read (but, as with the last point, you've read no data).

Take a look at "perlintro: Files and I/O" for the very basics; then follow the links in that section for more details.

On to solutions:

An appropriate solution will depend very much on how you answered the earlier questions regarding your data.

When formulating a character class (see "perlrecharclass: Bracketed Character Classes"), consider a negated whitelist rather than attempting to generate a blacklist. If you're working with just DNA, you only want to keep [ACGT]: in other words, you want to remove everything which matches [^ACGT] (no need to worry about whitespace matching newlines, carriage returns, spaces, tabs, and so on). [[^ACGU] for RNA or [^ACGTU] for both DNA and RNA.]

For a one-off solution (with a smallish file), ++vinoth.ree's one-liner may well be appropriate; although, using the whitelist:

s/[^ACGT]//g
[download]

For a very large file, you may find transliteration is more efficient than regex substitution:

y/ACGT//cd
[download]

[See "perlperf: Perl Performance and Optimization Techniques: Search and replace or tr" for details (note: y and tr are synonyms) and Benchmark to check for yourself.]

For multiple files, an actual script would be better. The code to make the changes would be much the same. You'll need to do the I/O yourself: see earlier notes about this.

Another solution, which I suspect is probably inappropriate, but mentioned for completeness, is Tie::File. This effectively allows direct editing of disk files but, given the data I envisage you're working with, is likely to be horribly slow.

That may be sufficient information for you to complete your task. Feel free to ask if further information or general help is required.

— Ken

[reply]
[d/l]
[select]

Re: Removing white space from the file
by Monk::Thomas (Friar) on Aug 12, 2015 at 07:59 UTC

$fh =~ s/\s//g;
[download]

You can not apply a substitution to a file handle. (Well you can, but it does not modify the actual file content.)

Another problem is:

open(my $fh, ">> uuu.txt")
[download]

...which would open the file for appending, but NOT for reading.

You did not specify whether you want an in-place edit of the file or whether you just want to convert the content for further processing. (For an in-place edit I would actually prefer to use something like 'sed -i~ s/[0-9 ]//g uuu.txt' instead of perl, because I get a backup copy of the original file for free.)

[reply]
[d/l]
[select]

Re: Removing white space from the file
by Athanasius (Archbishop) on Aug 12, 2015 at 08:06 UTC

Hello GSperlbio,

Yet another option is to use the core Tie::File module:

#! perl
use strict;
use warnings;
use Tie::File;

my $data = 'uuu.txt';

tie my @lines, 'Tie::File', $data
    or die "Cannot tie file '$data': $!";

s{[^ACGT]}{}g for @lines;

untie @lines;
[download]

But note that this will not remove newlines.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]

Re: Removing white space from the file
by Laurent_R (Canon) on Aug 12, 2015 at 08:00 UTC

So you would need something like this:

# ...
open my $in, "<", "uuu.txt" or die "cannot open the input file$!";
open my $out, ">", "out.txt" or die "cannot open the output file$!";
while (my $line = <$in>) {
     $line =~s/[\s\d]//g;
     print $out "$line\n";
}
close $_ for ($in, $out);
# ... renaming, etc.
[download]

vinoth.ree

perl -i.bak -p -e 's/[0-9\s]//g' uuu.txt
[download]

[reply]
[d/l]
[select]

Re: Removing white space from the file
by marinersk (Priest) on Aug 12, 2015 at 12:34 UTC

While there's lots of good information in the previous answers about how to go about doing this, I tried to modify your script as little as possible.

Notes follow, as the changes needed were fairly extensive.

#!/usr/bin/perl
use strict;
use warnings;

# Set aside a place to store the data between read and write phases
my @OutputBuffer = ();

# Read whole file into memory, since we will eventually overwrite it
# Often called "slurp"
open(my $inputFH, "<", "uuu.txt") || die "cant open the original file"
+;
while (my $inputLine = <$inputFH>)
{
     print $inputLine;
     # Since we're here, let's process the data
     # Often called "digest"
     chomp $inputLine;
     $inputLine =~ s/\s//g;
     push @OutputBuffer, $inputLine;
}
close $inputFH;

# Destroy old file and rewrite with new data
# Often called "spew"
open(my $outputFH, ">", "uuu.txt") || die "cant open the new file";
foreach my $outputLine (@OutputBuffer)
{
     print $outputFH "$outputLine\n";
}
close $outputFH;

# Fini
exit;
[download]

This adjusts your script to remove the whitespace, but has not yet added code to remove the numerical characters. I think you can probably handle that on your own, based on the code you've already generated.

Notes

If you wish to read in a file and then overwrite it with the modified data, the data has to live someplace in the interim.
- ~~You can't~~ It's so convoluted to try to read and write a text file in place as to be roughly equivalent to simply saying you can't do it.
- You further specified that you did not wish to write another file in the interim
- Thus, memory is the logical remaining option.
Reading the whole file into memory, processing it, and then dumping it all out to disk, is often referred to as "slurp, digest, and spew".
- This works fine on small files. Yours seems to be a small file.
- This works less and less well as the file size gets larger. Often referred to as "does not scale well".
So, the changes to your script I made above:

Created @OutputBufferto hold the data between the slurp and the spew.
Corrected the syntax on your openstatement so it would read the file, rather than try to append to it, during the slurp phase. Also adjusted to the three-argument form of openfor clarity.
Added the code to read the data from the file; opendoes not actually read any data from the file, it merely opens a channel for data to flow through. You have to actually tell Perl to read the data yourself.
Went ahead and did the digest in the slurp phase; it was convenient, and (admittedly trivially) reduces the amount of data we have to store in memory).
Closed and re-opened the file for write. Did not append as you did in your example, because then you'd have the old data plus the new data in it, and risks growing exponentially with replicated data. Not sure why you chose >>, but guessing it was because you misunderstood how file I/O operations actually worked in Perl (seeing as how you seemed to think you'd just get data by opening it). If append is what you really want, we can chat about what you're trying to accomplish; but from what I see, a flat overwrite seems to be what you're looking for.
Added the spew loop.

Good luck!

[reply]
[d/l]
[select]

Re: Removing white space from the file
by anonymized user 468275 (Curate) on Aug 12, 2015 at 11:50 UTC

1) (assuming you have very limited disk-space, but enough memory to read in file)

- read the file in, remove spaces in memory then write it back

open my $fh, 'uuu.txt';
my @new = map { s/\s+//g; } <$fh>;
close $fh;
open $fh '>uuu.txt';
print $fh @new;
close $fh;
[download]

- for each line read in, remove spaces and write to a compressed file, perhaps on /tmp if it has the space.

- uncompress new file over old

- remove compressed file

open my $fh, 'uuu.txt';
open my $gh, '| bzip2 -9 -c > /tmp/uuu.txt.bz2'; # or gzip if no bzip2
while (<$fh>) {
    s/\s+//g;
    print $gh $_ or suffer();
}
close $fh;
close $gh or suffer();
system 'bunzip2 -c /tmp/uuu.txt.bz2 > uuu.txt';
unlink '/tmp/uuu.txt.bz2';

sub suffer {
   warn "$!: /tmp/uuu.txt.bz2\n"; # e.g. $! reports disk full
   unlink '/tmp/uuu.txt.bz2';
   exit 256;
}
[download]

One world, one people

[reply]
[d/l]
[select]

Re: Removing white space from the file
by crusty_collins (Friar) on Aug 12, 2015 at 14:04 UTC

use strict;
use warnings;
use File::Copy;

my $newfile = 'uuu.new';

open (FH , ">  $newfile");

foreach my $line (<DATA>){

    print $line . "\n";

    # remove digits and spaces
    $line =~ s/[\d\s]+//g;

    print " New line : $line \n";
    print FH $line . "\n";
}

close FH;

__DATA__
1 TCCAAGGATA AGTATGTAAA TACGGGGCGG GCTCTGGGAG GGGAGAGACT TTACAAAAAT
61 GAGGGCTTTT ATTTTCCATT TGGAACGTGG GACAACAGAC CACAACGCAA TTCCATTTTG
121 CAAGTCTTTC CAAGGGAGAA GCTGTTCAAC CACCCGTTTG GGGGATGAGT GAGCCGACAC
[download]

[reply]
[d/l]