I am using Spreadsheet::WriteExcel to create a file with 20,000 lines. The size of the file is quite large - over 100 megabytes.

However, when I open the file and save it as a regular Excel file, the file size goes down to 6 megabytes!

...

Any idea why this happens ...

The difference is simply that between the text formatting of your input file and the binary data format of .xls files.

Your input is ascii-encoded tab separated values. 1 million records each of 100 small integers takes nearly 400MB:

C:\test>perl -le"print join chr(9), map int( rand 1000 ), 1 .. 100 for + 1 .. 1e6" > junk.dat C:\test>dir junk.dat 02/01/2012 09:05 390,001,724 junk.dat

Conversely, that same data stored as binary takes just 2MB:

C:\test>perl -E"binmode STDOUT; print pack 'S', map int( rand 1000 ), +1 .. 100 for 1 .. 1e6" >junk.dat C:\test>dir junk.dat 02/01/2012 09:06 2,000,000 junk.dat

Mind you, compared with the latest file format used for such data, that ridiculous 200:1 size ratio is positively sane. The same data stored as XML:

#! perl -slw use strict; print '<file rows="1000000" columns="100" >'; for my $row ( 1 .. 1e6 ) { print qq[<row id="$row">]; print join '', map qq[<col id="$_">${ \int( rand 1000 )}</col>], 1 + .. 100; print qq[</row>]; } print '</file>'; __END__ C:\test>junkxml >junk.xml C:\test>head junk.xml <file rows="1000000" columns="100" > <row id="1"> <col id="1">897</col><col id="2">421</col>... "24">753</col><col id="25">707</col><col i... id="47">195</col><col id="48">889</col><c... ><col id="70">450</col><col id="71">779</c... ol><col id="93">769</col><col id="94">90</... </row> ... C:\test>dir junk.xml 02/01/2012 09:45 2,206,994,460 junk.xml

Takes up an absolutely ludicrous 2.2 Gigabytes or over 1000 times as much space to hold the exact same data!

Of course, the anti-binary file lobby and XML advocates will tell you that "binary files can't be manually inspected and edited" the way text files can. But really? Who is going to manually verify or edit 2 million numbers? Let alone do so whilst wading their way through the 2.2GB of pointless, repetitive, eye-crossing verbosity that is XML.

And then there is that other stand-by defence: "Disk space is cheap". Which is true. But IO transfer time and encoding/decoding costs definitely aren't.

They'll also say that text formats compress with much higher compression ratios than binary data, which they do. But that completely misses the point that even compressed with the very best algorithms, these text formats still require 100s of times more space than the uncompressed binary data.

The text version:

C:\test>dir junk.dat 02/01/2012 09:14 389,997,094 junk.dat C:\test>bzip2 -9 junk.dat C:\test>dir junk.dat.bz2 02/01/2012 09:14 131,705,924 junk.dat.bz2

And the XML version:

C:\test>dir junk.xml 02/01/2012 09:45 2,206,994,460 junk.xml C:\test>bzip2 -9 junk.xml C:\test>dir junk.xml.bz2 02/01/2012 09:45 216,299,791 junk.xml.bz2

Sure, a 10:1 compression rate is amazing, until you realise that the result still takes up 100 times as much space as the raw binary data. And it will therefore take 100 times as long to read from disk. Not to mention the 7 minutes of CPU it took to compress it; and the 2 minutes of CPU it'll take to decompress it and another 2 minutes of CPU it'll take to parse the XML; in order to get back the 2MB of data it contains that can be loaded from disk in less than 1 second.

So the answer to your question: "how I can make the original file smaller?" is, store it in binary format and save time, hassle, disk and cpu.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?


In reply to Re: Spreadsheet::WriteExcel large files (text versus binary format) by BrowserUk
in thread Spreadsheet::WriteExcel large files by mrguy123

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.