Re: Convert numbers from a binary file to texts

The little I could find is that the SFF format has a structure you need to unpack each time.

See the SFFStruct:

http://nl.mathworks.com/help/bioinfo/ref/sffread.html?s_tid=gn_loc_drop

Now, the first bytes seem to be a set of packed bytes that are called InfoStruct:

http://nl.mathworks.com/help/bioinfo/ref/sffinfo.html

One of the links says that a sample SFF can be downloaded from NCBI, but I can not find it on that page. If you have a sample, maybe we can work something out. (as mathworks does not TELL what size each struct is or delimiters it has, i.e. the unpack string is more complex that "C1")

Failed attempts at searching for SFF: Serialize Flat File, Structured Fax File Format

What we seem to be looking for: Standard Flowgram Format (SFF)

Found something at sra-tools with a good help Building-and-Installing-from-Source

Inside the sources, I see (search sff that SRASFFFile_Read() is already doing a good job on reading.

As I see things. There is this, hard to compile C code ready to be used. Or even harder to reverse engineer the structure and implement it in Perl...

However, there are PREBUILD binaries (Linux AND Windows) in: https://github.com/ncbi/sra-tools/wiki/Downloads

And the tools inside it are sufficient to see SFF information:

sff-dump -Z SRR996643 | sffinfo - | less
Allows you to view the data in sff.txt (human readable) format. With t
+he 454 utility 'sffinfo' installed, this command dumps sff to stdout 
+(-Z), pipes it to sffinfo ("-" needed for sffinfo to accept from stdi
+n), and then to 'less' for viewing.
[download]

So unless you have a MAC... just download and use... edit: Nope, MacOS also prebuild.

... not quite so. sffinfo is not part of the package, there are references though, for example on 454.com and mothuR.

I'm getting the idea this bioinformatics world is more complex than it should be...

new edit/addendum:

./ncbi-vdb/interfaces/sra/sff-file.h

/*====================================================================
+=======
*
*                            PUBLIC DOMAIN NOTICE
*               National Center for Biotechnology Information
*
*  This software/database is a "United States Government Work" under t
+he
*  terms of the United States Copyright Act.  It was written as part o
+f
*  the author's official duties as a United States Government employee
+ and
*  thus cannot be copyrighted.  This software/database is freely avail
+able
*  to the public for use. The National Library of Medicine and the U.S
+.
*  Government have not placed any restriction on its use or reproducti
+on.
*
*  Although all reasonable efforts have been taken to ensure the accur
+acy
*  and reliability of the software and data, the NLM and the U.S.
*  Government do not and cannot warrant the performance or results tha
+t
*  may be obtained by using this software or data. The NLM and the U.S
+.
*  Government disclaim all warranties, express or implied, including
*  warranties of performance, merchantability or fitness for any parti
+cular
*  purpose.
*
*  Please cite the author in any work or product based on this materia
+l.
*
* ====================================================================
+=======
*
*/
#ifndef _h_sra_sff_file_
#define _h_sra_sff_file_

#include <klib/defs.h>

#ifdef __cplusplus
extern "C" {
#endif

/* ===================================================================
+===
 * SFF defines an 8 bit value in the file that tells of the format of 
+the 
 * data signal (flowgrams in Roche 454 SFF parlance).
 *
 * The only currently defined format is a 16 bit unsigned integer in
 * units one hundredths. This enum is to easily allow us to add other
 * formats if ever required.
 */
typedef enum SFFFormatCode
{
    SFFFormatCodeUnset = 0,
    /* values are 16 integers of hundreths of units: 0 = 0.00, 1 = 0.0
+1, 2 = 0.02, ... */
    SFFFormatCodeUI16Hundreths,
    /* currently (SFF (00000001) yet this is the only one SFFFormatCod
+e is defined */
    SFFFormatCodeUndefined
}       SFFFormatCode;

/* -------------------------------------------------------------------
+---
 * Common Header Section 
 * (Genome Sequencer Data Analysis Software Manual Section 13.3.8.1)
 */
#define SFFCommonHeader_size 31

typedef struct SFFCommonHeader_struct
{
    uint32_t magic_number;         /* four bytes ".sff" as string: wit
+h wrong endian it would be "ffs." */
    uint32_t version;              /* four bytes 0x00000001 */
    uint64_t index_offset;         /* index_offset and index_length ar
+e the offset and length of an */
    uint32_t index_length;         /* optional index of the reads in t
+he file. If no index both are 0 */
    uint32_t number_of_reads;      /* The number of reads in the file 
+(not individual datum) */
    uint16_t header_length;        /* length of all headers in this se
+t.  31 + flow_length + key_length + pad to 8 byte boundary */
    uint16_t key_length;           /* length of the key sequence for t
+hese reads */
    uint16_t num_flows_per_read;   /* the number of flows for each rea
+d in this file */
    uint8_t  flowgram_format_code; /* SFFFormatCode between (SFFFormat
+CodeUnset..FormateCodeUndefined) exclusive */
    /* not included variable length portion of header:
        flow chars   - sequence of uint8_t, actual length is num_flows
+_per_read above
        key sequence - sequence of uint8_t, actual length is key_lengt
+h above
        padding      - sequence of zeroed uint8_t to make total length
+ of file header 8-byte padded
    */
} SFFCommonHeader;

/* -------------------------------------------------------------------
+---
 * Read Header Section 
 * (Genome Sequencer Data Analysis Software Manual Section 13.3.8.2)
 */
#define SFFReadHeader_size 16

typedef struct SFFReadHeader_struct
{
    uint16_t    header_length;            /* length in bytes of the fu
+ll section including padding */
    uint16_t    name_length;            /* length of the name of this 
+spot */
    uint32_t    number_of_bases;
    uint16_t    clip_quality_left;
    uint16_t    clip_quality_right;
    uint16_t    clip_adapter_left;
    uint16_t    clip_adapter_right;
    /* not included variable length portion of header:
        name    - sequence of uint8_t, actual length is name_length ab
+ove
        padding - sequence of zeroed uint8_t to make total length of r
+ead header 8-byte padded

        read data section:

        signal - sequence of uint16_t (if flowgram_format_code == SFFF
+ormatCodeUI16Hundreths, see enum above),
                 actual length is num_flows_per_read from file header 
+above
        flow_index_per_base (position) - sequence of uint8_t, actual l
+ength in number_of_bases above
        bases - sequence of uint8_t, actual length in number_of_bases 
+above
        quality_scores - sequence of uint8_t, actual length in number_
+of_bases above
        padding - sequence of zeroed uint8_t to make total length of r
+ead data section 8-byte padded
    */
} SFFReadHeader;

#ifdef __cplusplus
}
#endif

#endif /* _h_sra_sff_file_ */
[download]

And thus the first part of the first part would be:

($FOURCC, $version, $index_offset, $index_length,$number_of_reads, $header_length, $key_length, $num_flows_per_read, $flowgram_format_code) = unpack("c4 L Q L L S S S c", $data);

From there we need to read some more bytes:

$padding = 0; while( ($num_flows_per_read + $key_length + $padding) % 8 ) {++$padding}

$SFFCommonHeader_struct_size = 31 + $num_flows_per_read + $key_length + $padding;

Bytes still to read at this point = $SFFCommonHeader_struct_size - 31;

Ok, that, so far, is just the header. Tired now, going to do other things. Maybe someone else wants to continue...

DISCLAIMER: I have no idea what I am doing. I did not check any code.

Comment on Re: Convert numbers from a binary file to texts Select or Download Code