http://qs1969.pair.com?node_id=310619

Item Description: a module for simple data munging

Review Synopsis:


Overview

desift is a Perl program for data munging. What is data munging? According to Dave Cross, author of ``Data Munging in Perl'', data munging has 3 phases:

  1. Read in the data
  2. Transform the data
  3. Output the data

Since desift is a data munging program, we can describe it via this framework.


Read in the data

desift reads in your data from STDIN or from files specified on the command-line after the option switches. It splits your data into a Perl array for you. You control the split via the -d switch:

  -d REGEX      Field delimiter in input-file(s). Default is "\t" (tab)

Filtering

A common phase of reading in data is filtering out what you don't want. To specify input lines that you want to skip, supply the -s option to desift:

  -s REGEX      Skip rows in input-file(s) matching REGEX.


Transform the data

In desift, the input data is transformed via a template string which may be supplied on the command-line or in a file. There are two elements to the template string: plain text and positional tags indexing into the array built from splitting your input data.

If your template string is in a file, use the -t option. If you want to supply the template string on the command-line, then use the -T option.

Here is a sample desift command using flags we have seen so far:

   ls -l | desift -d"\s+" -T"File: %9        Permissions: %1" -s"^total"

It's not a completely perfect example, because filenames with spaces will only have the part listed before the space in the filename. Here is some sample output:

   File: chessgames-dotcom        Permissions: -rwxr-xr-x
   File: desift        Permissions: -rwxrwxrwx
   File: desift.pod        Permissions: -rw-r--r--
   File: gerbold        Permissions: drwxr-xr-x+
   File: upload-cpan.pl        Permissions: -rwxr-xr-x
   File: xemacs.bat        Permissions: -rwxr-xr-x

We can learn some things from looking at this output. First of all, the word Permissions does not always start at the same column. My first attempt to fix this was to put a tab in the template string. To do so, you must manually put a tab in the string: \t or \\t or changing the string from single to double-quote does not work. Even so, the output is still not lined up:

   File: chessgames-dotcom      Permissions: -rwxr-xr-x
   File: desift Permissions: -rwxrwxrwx
   File: desift.pod     Permissions: -rw-r--r--
   File: gerbold        Permissions: drwxr-xr-x+
   File: upload-cpan.pl Permissions: -rwxr-xr-x

And this makes output hard to read. I envision two possble solutions to this problem. One possible fix is to have a template flag which takes a numeric argument indicating at which column the output should be written. Another fix is more time and compute-consuming. Sift could read in all the lines and then output them with just enough space for the columns to line up... sort of like a database does when you SELECT data.


Outputting the data

In a sense, transformation and output are one step in desift. Once a line of data is transformed, it is then output.


Assessment

desift is a cleanly written module which simplifies and abstracts the split, array-slice, print-and-join cycle of programming leading to one succinct command-line instead of a series of function calls.

In looking back at a recent project of mine, I find desift to be inadequate for what I had to do. I had a CSV-file with name, email, phone, etc. I had to filter this file for profanity and invalid email addresses and then output the new file in a tab separated format for import into a database.

desift's limitations

First of all, the input phase. Parsing CSV is not easy. I could not pass desift a regexp to do such a split and field-massage properly. Also, some of the data was in Unicode format and only Text::CSV_XS with its Binary option was robust enough for this task. Also I was dealing with 4 files of 25 million lines each, so using a c-based module such as CSV_XS was desirable for speed reasons as well.

For the transformation phase *of this project*, desift was adequate. However, what if I wanted apply the Perl lc function to a field instead of just writing it? That is a very likely operation and impossible with desift. Now, what if the template for desift were passed off to sprintf instead of its own custom sprintf-like formatter? And what if access to the split array were via a localized variable such @_split? Then we could do lc if we wished:

<code> ls -l | desift -s``^total'' -d``\s+'' \ --sprintf-string=``File: %s Permissions: %s'' --sprintf-args=``lc($_split8), $_split[0]'' </code>

Also why is there not possibility of filtering data after it is split? Perhaps we can only determine if data should be transformed or output after a test of some sort. A test such as adding two fields together, doing a SQL SELECT on a database or grepping a file, or seeing if a certain or all fields were in the line. Thus we can conclude that filtering callbacks should be available at each stage as executable Perl subroutines and not limited to regular expressions at any stage.

As mentioned earlier, the output phase is wed to the transformation phase. Thus it is up to you use I/O redirection to capture desift output. For example, it is not (and perhaps should not be?) possible to commit desift results directly to a database on a line-by-line basis. But in my experience, c-based SQL loaders supplied with databases are 8-fold faster than isolated inserts via Perl/DBI. So, complete of control over channeling output may or may not be a Bad Thing.

desift and the Future

There are a number of splitting and filtering and formatting modules available on CPAN:

splitting (a common form of input transformation):

Parse::FixedLength, Parse::FixedDelimiter, Text::xSV, Text::CSV_XS, DBD::AnyData, Spreadsheet::ParseExcel.

filtering

Core Perl provide grep which is extremely adequate in a large number of cases. Common CPAN modules for doing so are Regexp::Common::profanity_us and Email::Valid.

formatting

Core perl provides HERE documents, sprintf. And then on CPAN, there are very few options for formatting a data model (hah!). Need I list any? Template, Data::Table, Data::Presenter, HTML::Template, Text::MagicTemplate


Obtaining desift

desift is available at http://desift.sourceforge.net

It was written by James Shimada.