pytheas has asked for the wisdom of the Perl Monks concerning the following question:

hi guys. i am a biologist and i haven't been programming in perl for a long time so my question may seem a little naive but i'm at a total loss and any help would be appreciated. i need a perl script to convert some binary files (.dcd if anyone knows what they are) to ascii or just extract the ascii that they contain. i tried plain read with no results. i searched the net and i only found this code :
use warnings; use strict; @ARGV == 2 or die "usage: $0 in_filename out_filename\n"; # get first argument, i.e filename my $in_filename = shift; print "You chose input <$in_filename>\n"; my $out_filename = shift; print "You chose output <$out_filename>\n"; #set infile to binary mode open INFILE, '<:raw', $in_filename or die "can't open $in_filename: $! +"; open OUTFILE, '>', $out_filename or die "can't open $out_filename: $!" +; # read 8 bytes at a time $/ = \8; while ( <INFILE> ) { print OUTFILE join( ', ', map sprintf( '0x%04x', $_ ), unpack 'S*', $_ + ), "\n"; }
but when i run it it comes out with four columns of something seeming like hexadecimal (0xe15d ...). The dcd files are about 700 mb so i can't upload it for you to see it, but i managed to cut the first 80kb of it with hexedit and i upload it here: http://www.gigasize.com/get.php?d=7qfs5f331bf Does anyone have any idea or some plain guidelines because i really need get it done for my work. Thanks anyway!

Replies are listed 'Best First'.
Re: binary to ascii convertion
by GrandFather (Saint) on Oct 21, 2008 at 10:16 UTC

    It seems most likely (from a little googling) that your ".dcd" files are Charmm / xplor-format DCD files generated by FORTRAN. It further seems that there are various different versions and file formats associated with DCD. Unless you can find something that documents the format of the file it will be rather difficult to pull it apart. DCD seems to be related to PDB which is documented at http://www.wwpdb.org/docs.html. update PDB looks like a red herring.

    Update http://local.wasp.uwa.edu.au/~pbourke/dataformats/fortran/ may help too. It describes how FORTAN packs "unformatted" files.


    Perl reduces RSI - it saves typing
Re: binary to ascii convertion
by cdarke (Prior) on Oct 21, 2008 at 12:30 UTC
    or just extract the ascii that they contain

    This basic code will print an ASCII printable character, or '?' if outside the range:
    use warnings; use strict; while(<>) { for my $char (split '') { my $ord = ord($char); print $ord > 31 && $ord < 128 ? $char : '?' } }
    Run it from the command line like this: myscript.pl filename
    Note that this only covers ASCII, if you are using ISO Latin 1 (which you might be) then change the 128 to 256.

    Update: The code above is not ideal because it relies on new-lines to terminate each "record", which might not exist. Might be better with:
    use warnings; use strict; open (my $handle, '<', $ARGV[0]) or die "Unable to open $ARGV[0]: $!"; binmode $handle; while(read ($handle, my $buffer,80)) { for my $char (split '', $buffer) { my $ord = ord($char); print $ord > 31 && $ord < 128 ? $char : '?' } print "\n"; }
    Run it in the same way as before.
Re: binary to ascii convertion
by almut (Canon) on Oct 21, 2008 at 14:26 UTC

    Not Perl... but maybe the R package Bio3D helps (directly or indirectly). At least, it contains a routine for reading .dcd files.

    (Or, in case you really have a need to do this in Perl, you could try to figure out what the R code does, and then reimplement it in Perl...  Or, read/convert the data via R into some other format that's easier digestible in Perl...)

      this with the R routine is extremely interesting and effective. this is fortan right?
Re: binary to ascii convertion
by graff (Chancellor) on Oct 21, 2008 at 13:36 UTC
    If you are using a linux/unix box (or have unix tools ported to windows), there is a command called "strings" that extracts printable ascii content from binary files and dumps it to stdout. Try this shell command on your file:
    strings yourfilename.dcd | less
    And of course, you can redirect stdout to some other file, in case that helps you get your work done.

    (BTW, I don't know what's up with that download link you provided. I couldn't seem to get any "dcd" data file, but I did get an invitation to spend money for some service that I presumably don't need.)

Re: binary to ascii convertion
by gone2015 (Deacon) on Oct 22, 2008 at 11:09 UTC

    The reference that GrandFather gave indicates that .dcd files are broken into records thus:

      <length0><record0><length0>
      <length1><record1><length1>
      ...
    
    each record being bracketed by its length. Quick inspection of your file indicates that the lengths are 32-bit integer, and little-endian. The lengths are for the record part only, so record0, with its overhead, occupies the first length0 + 8 bytes of the file.

    The code below is a quick and dirty .dcd reader.

    Having established the record structure, the real problem appears to be that you need to know the format of each record in order to be able to unpick it.

    I had a quick go, first to extract only characters \x20-\x7F, replacing runs of other stuff by '~~' and singletons by '~'. I observed that where there were numbers, they appeared to be 32-bit, so I also took each record in 4-byte groups, and any that were not 4 characters \x20-\x7F I also rendered as integers (unsigned), and if plausible as floats -- indicating sections of characters as #999.

    The result was a bit disappointing:

         0: 'CORDe~~8~N~~X~P~~'=~~'
            #4 0x365 0x14E1438(3.78507e-38) 0xC8 0x150BA58(3.83373e-38)
            0x0 0x0 0x0 0x0 0x0 0x3D2790E3(0.0409097) 0x1 0x0 0x0 0x0
            0x0 0x0 0x0 0x0 0x0 0x18
         1: '~~REMARKS FILENAME=output/restart6_out.dcd CREATED BY NAMD      '
            '                  REMARKS DATE: 06/07/07 CREATED BY USER: fadoul'
            'o                                 '
            0x2 #160
         2: '$~~'
            0xE824
         3: 'k+~~_@~~V@~3~!-~R@~~V@~~V@~~C~FO@'
            0xF2CA2B6B(-8.00876e+30) 0x405FE6F2(3.49847) 0x0 0x40568000(3.35156)
            0x21FF33A1(1.72931e-18) 0x40529C2D(3.29078) 0x0 0x40568000(3.35156)
            0x0 0x40568000(3.35156) 0x43199BB3(153.608) 0x404F4683(3.23868)
    failed Record length 237712 > rest of file 86216 (@0x14C of file 'skata3')
    
    such is life. Record 4 is big. It appears to be 59,428 32-bit numbers, at least going by the first 100 or so. I note that Record 2 contains the value 59,428.