I don't know about the industry, but my Master's project involves visualizing LC-MS (Liquid Column Mass Spectrometry) data. The raw data, once converted from the proprietary format, is a 130 megabyte XML file, containing 128 "scans", each of which contains roughly 200,000 pairs of (mass, intensity) pairs, called peaks (both floating point numbers). Relax, the pairs are not stored inside verbose XML tags, but the actual data is a big Base64 encoded, packed string.

I don't luckily have to use the raw data, as the proprietary conversion tool is able to produce centroided data (which, by the way, reduces the dataset size by four orders of magnitude). However, I used the raw data when I tested parsing the XML file -- a great way to see which part of your code scales up and which assumes that the data is not big.

I have to parse the XML file in two passes. On the first pass, the metadata is read in. On the second pass, byte offsets to the beginning of the actual Base64 encoded data blocks and their lengths are saved in memory. Later, one can read in individual scans from the file by simply seeking to the position and reading that many bytes. Initially, I had the parser (which, by the way, is XML::Twig) do both in one pass, but this was horribly slow for some reason. I suspect XML::Twig tried parsing the Base64 encoded data as XML, because with two passes, I can simply make it skip the data in both. Could be a user error.

Although it's not that big a file, parsing it takes half a minute, and reading in the peak data takes five or six minutes. I'm storing the peaks in piddles, so they don't take much more space in memory than on disk. Then, in the visualizing phase, it takes seven or eight minutes to thread over the 128 piddles and plot them. This is on Athlon XP 2600+ with half a gigabyte of memory. This is a bit of a pain, as the application area requires you to be able to zoom in to the data, and waiting eight minutes between zooms is not something the casual user is going to do, unless he's smoking weed.

I suspect I could squeeze that to four minutes by using C, but currently there's not much need for that, as the centroided data is good enough, and I'm already using PDL as much as I can (i.e. in tight, implicit loops).

--
print "Just Another Perl Adept\n";


In reply to Re: What is a "big job" in the industry? by vrk
in thread What is a "big job" in the industry? by punch_card_don

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.