Hello all,
I'm writing a technical paper on the use of Perl as a data mining tool, and I want your massed wisdom.
My intention is to write a paper that tantalizes people with the power of Perl, and provides community wisdom for all of us who write data processing scripts. I am not under contract to write this, nor will I make a dime off of it. In fact, I don't even have a place to submit this to... I'll probably just post it on my website.
For those who want a real publication, I think a book such as Data Munging with Perl would do. I just want to coalesce some experience into practical advice and example.
I've been programming Perl for a long time now. I could classify most of my work as 'data management' or perhaps 'data mining' -- bringing order and meaning to text data. Be it logfile, list, or database dump, it seems like the method of extracting data can be codified. Or at least organized. Well, maybe some principles can be gleaned. Or how about just sympathy?
Casting about for a buzzword, I'd like to call it "Bottom-Up Data Mining". Or maybe "Bottom-Up Data Analysis". Akin to the idea of bottom-up programming, bottom-up data mining is an iterative process by which data, starting out as 'unknown', proceeds along a path of extraction to a final 'known' state, where all meaningful data (for the task at hand) can be extracted with acceptable accuracy. Starting at the bottom, the Perl programmer has some text files of data to mine, and perhaps written requirements. As the script is iteratively written and refined, data from the source files resolves into greater detail.
There's nothing earth-shattering there. But the sheer volume of experience from Perlmonks can contribute a gazillion suggestions and hints. Writing a paper is my way of organizing my thoughts. It can also be a contribution to the world at large about how any programmer can learn how to mine data with Perl in one afternoon.
Subjects I've been thinking about covering in the paper includes "do's and don'ts" of text processing, including what I call "The Burrito Principle" (stolen from the Pareto Principle) which basically says to aim for the most useful data first ("80% of the meat is in 20% of the burrito"). I'll mention some good ways of parsing text, including paying careful attention to the record separator, and modules in CPAN that may have already invented your wheel for you. I'm also thinking about a section that mentions programming languages that are likely to be helpful for data mining, and languages that are likely to be more trouble than they're worth for data mining (Java).
Needless to say, the paper will have a subjective feel to it, but I'm offsetting that with code snippets to inject a dose of reality into my arguments.
So, your comments, suggestions, and offerings would be nice.
Rob
Re: Bottom-Up Data Mining with Perl
by dragonchild (Archbishop) on Mar 05, 2003 at 18:42 UTC
|
Right off the top of my head, some things to include would be:
- The flip-flop operator
- The ideas of $\, $", and others.
- chomp vs. chop and where each is good
- Templates. They're not just for HTML! (Useful for reading as well as writing.)
- Data-driven parsing.
- Functional parsing (different from data-driven). tilly wrote something very cool on this topic regarding HTML-like parsing with functional programming.
- When to use a regex vs split vs unpack.
- How to use unpack! (I still don't get how to use it ...)
- The Burrito principle. (Very cool!)
Post your paper on PM when it's done. I would love to read it!
------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified. | [reply] |
|
Could you expand on how split, pack, unpack and regexes are related? I feel there's something to what you say, but I can't at all pin it down.
| [reply] |
|
split, unpack, and regexes are all ways to parse a given line of data. Each is useful in different circumstances. For example:
- split is more useful with delimited lines, such as tab-delimited or comma-delimited. (However, using a module like Text::CSV is better for delimited text. This is because of lines like "abcd,'Smith, John', blah" - the comma in the quotes is part of the item, not a delimiter.) Now, one could use a regex here, but the regex is harder to understand, and even harder to get right.
my @items = split $delim, $line;
#### vs. (and I know this will make mistakes
my @items = $line =~ /^?([^$delim]*)(?:${delim}$)?/g;
- unpack (if you understand how to use it!) is really good with data that is formatted, like so many columns is the first thing, so many the second, etc. This is often data from a mainframe.
Again, you can use a regex here, but you have to roll it for it to be maintainable. (I'd put an unpack example here, if I was comfortable knowing how to work it.)
my @columns = ( 20, 10, 25, 5, 2, 2, 20);
my $regex = map { "(.{$_})" } @columns;
$regex = qr/^${regex}$/;
my @items = $line =~ /$regex/;
For every example I give on different parsing needs, there is a module on CPAN that does it better, faster, and safer. I personally would never hand-parse data in production. Heck, you can use CGI to parse HTML pages without even having an http server!
------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] [d/l] [select] |
Re: Bottom-Up Data Mining with Perl
by gmax (Abbot) on Mar 05, 2003 at 18:55 UTC
|
Here's some food for thought.
Check
this article and the relative meditation on this site.
My personal advice is related to database programming.
You should explain when to use perl and when to delegate the task to a database engine. For example, comparing large lists of records and making statistics out of them is a task for a DBMS, rather than a candidate for a perl script. Of course, in such case you should introduce the beauty of the DBI.
Also the opposite cases are interesting, when people try to do in SQL things that should be better left to perl. As an example, parsing and cleaning data before storing it into a database is a perfect job for perl, which is the best companion tool for every database administrator.
_ _ _ _
(_|| | |(_|><
_|
| [reply] [d/l] |
Re: Bottom-Up Data Mining with Perl
by l2kashe (Deacon) on Mar 05, 2003 at 19:42 UTC
|
Alot of data processing is having your data in the right components which align with your view of it.. Or rather I would touch upon uses for arrays, hashes, AoAs, AoHs, HoAs, and HoHs, and how they work together.. Possibly use the Schwartzian Transform to illustrate the ability to use those kinds of data strucures to perfom complex sorting in an elegant way...
Off the top of my head, a decent example would be a hash of arrays which contains the months names, the numerical value and how many days are in them ala
%months = (
Jan => [1, '31'], Feb => [2, '' ], Mar => [3, '31'],
Apr => [4, '30'], May => [5, '31'], Jun => [6, '30'],
Jul => [7, '31'], Aug => [8, '31'], Sep => [9, '30'],
Oct => [10, '31'], Nov => [11, '30'], Dec => [12, '31']
);
# Get number of days in Jan
$days = $months{Jan}->[1];
It could also be written as a hash of hashes, but I think that shows what I mean.. I also avoided adding the logic to determine if its a leap year, and appropriatly set Feb->1, as I thoroughly *hate* that aspect of our calendar system..
Anyway what I was trying to say is we now have all the info we need about the months of the year.. a third element could be the month prior to, and a fourth element could be the next month.. you could extract the months in order or reverse via
# in numerical order
for ( sort { $a->[0] <=> $b->[0] } keys %months ) {
# or reversed
for ( sort { $b->[0] <=> $a->[0] } keys %months ) {
The possibilities are endless.. I see alot of people who dont fully utilize the data structures to represent their data and how it relates to itself.. Maybe they attempt to have a bunch or arrays and loop through one while pulling data from another (nothing wrong with this approach), as opposed to slapping the data into a single array of arrays.
Also maybe touch upon the speed factors of using refs as opposed to passing stuff around by value, and the tradeoffs of arrays vs hashes.. If they are doing 5 extra calulations attempting to figure out which array index to get, maybe they should be using hashes, especially if its in a tight loop etc (trivial off the top of my head example)..
/* And the Creator, against his better judgement, wrote man.c */ | [reply] [d/l] [select] |
|
my @month_names = qw( undef, jan feb mar apr may jun jul aug sept oct
+nov dec );
my @month_days = ( undef, 31, &feb_days, 31, 30, 31, 30, 31, 31, 30, 3
+1, 30, 31 );
my $months = {
name => { map{ $_, $month_names[$_]} 1..12 },
number => { map{ $month_names[$_], $_ } 1..12 },
days => { map{ $_, $month_days[$_], $month_names[$_], $month_day
+s[$_] } 1..12 },
};
print "Month number 4 is ", $months->{name}->{4}, "\n";
print "Month mar is number ", $months->{number}->{mar}, "\n";
print "Days in month feb ", $months->{days}->{feb}, "\n";
print "Days in month 4 ", $months->{days}->{4}, "\n";
use Data::Dumper;
print Dumper $months;
sub feb_days { check_leap_year( get_year() ) ? 29 : 28 }
sub get_year { (localtime())[5] + 1900 }
sub check_leap_year {
my $year = shift;
my $leap_year = 0;
$leap_year = 1 if $year % 4 == 0;
$leap_year = 0 if $year % 100 == 0 and $year % 400 != 0;
return $leap_year;
}
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] [d/l] |
|
| [reply] |
|
|
Re: Bottom-Up Data Mining with Perl
by jryan (Vicar) on Mar 06, 2003 at 01:10 UTC
|
One good thing you might want to mention is the usefulness of grammars. Modules like Parse::Recdescent are enormously useful for parsing highly complex data. You could even talk about them in contrast to "The Burrito Principle" (which is the greatest buzzword ever!), since grammars are usually written from a top-down approach. However, even with grammars, you could still apply the Burrito Principle: it is usually best to work on the most complicated/important subrule within a set of subrules first, as others in the set will usually follow easily.
Update: Cleared up my thoughts.
| [reply] |
Re: Bottom-Up Data Mining with Perl
by derby (Abbot) on Mar 05, 2003 at 18:29 UTC
|
++ just for the "Burrito Principle" - you'll live forever in my 100% buzzword compatible life.
-derby | [reply] |
Inline::C
by zby (Vicar) on Mar 06, 2003 at 09:05 UTC
|
For huge datasets there should be some some place to use Inline::C. At this node you can find an example comparison of speed of a few algorithms in Perl and one in Inline::C. | [reply] |
Re: Bottom-Up Data Mining with Perl
by jdporter (Paladin) on Mar 06, 2003 at 06:12 UTC
|
| [reply] |
|
| [reply] |
|
I agree with allolex. My wife has done some amazing data-mining for job-boards when I was searching for a job. (Of course, this was all done before she learned how to cut'n'paste, but she's a smart one! *grins*)
------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] |
|
| [reply] [d/l] |
|
| [reply] |
Re: Bottom-Up Data Mining with Perl
by rje (Deacon) on Mar 13, 2003 at 06:14 UTC
|
| [reply] |
|
|