Bottom-Up Data Mining with Perl

Hello all,

I'm writing a technical paper on the use of Perl as a data mining tool, and I want your massed wisdom.

My intention is to write a paper that tantalizes people with the power of Perl, and provides community wisdom for all of us who write data processing scripts. I am not under contract to write this, nor will I make a dime off of it. In fact, I don't even have a place to submit this to... I'll probably just post it on my website.

For those who want a real publication, I think a book such as Data Munging with Perl would do. I just want to coalesce some experience into practical advice and example.

I've been programming Perl for a long time now. I could classify most of my work as 'data management' or perhaps 'data mining' -- bringing order and meaning to text data. Be it logfile, list, or database dump, it seems like the method of extracting data can be codified. Or at least organized. Well, maybe some principles can be gleaned. Or how about just sympathy?

Casting about for a buzzword, I'd like to call it "Bottom-Up Data Mining". Or maybe "Bottom-Up Data Analysis". Akin to the idea of bottom-up programming, bottom-up data mining is an iterative process by which data, starting out as 'unknown', proceeds along a path of extraction to a final 'known' state, where all meaningful data (for the task at hand) can be extracted with acceptable accuracy. Starting at the bottom, the Perl programmer has some text files of data to mine, and perhaps written requirements. As the script is iteratively written and refined, data from the source files resolves into greater detail.

There's nothing earth-shattering there. But the sheer volume of experience from Perlmonks can contribute a gazillion suggestions and hints. Writing a paper is my way of organizing my thoughts. It can also be a contribution to the world at large about how any programmer can learn how to mine data with Perl in one afternoon.

Subjects I've been thinking about covering in the paper includes "do's and don'ts" of text processing, including what I call "The Burrito Principle" (stolen from the Pareto Principle) which basically says to aim for the most useful data first ("80% of the meat is in 20% of the burrito"). I'll mention some good ways of parsing text, including paying careful attention to the record separator, and modules in CPAN that may have already invented your wheel for you. I'm also thinking about a section that mentions programming languages that are likely to be helpful for data mining, and languages that are likely to be more trouble than they're worth for data mining (Java).

Needless to say, the paper will have a subjective feel to it, but I'm offsetting that with code snippets to inject a dose of reality into my arguments.

So, your comments, suggestions, and offerings would be nice.

Rob

Comment on Bottom-Up Data Mining with Perl

Replies are listed 'Best First'.
Re: Bottom-Up Data Mining with Perl by dragonchild (Archbishop) on Mar 05, 2003 at 18:42 UTC
Right off the top of my head, some things to include would be: The flip-flop operator The ideas of $\, $", and others. chomp vs. chop and where each is good Templates. They're not just for HTML! (Useful for reading as well as writing.) Data-driven parsing. Functional parsing (different from data-driven). tilly wrote something very cool on this topic regarding HTML-like parsing with functional programming. When to use a regex vs split vs unpack. How to use unpack! (I still don't get how to use it ...) The Burrito principle. (Very cool!) Post your paper on PM when it's done. I would love to read it! ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Re: Bottom-Up Data Mining with Perl by jjohhn (Scribe) on Mar 10, 2003 at 01:58 UTC
Could you expand on how split, pack, unpack and regexes are related? I feel there's something to what you say, but I can't at all pin it down.	[reply]
Re3: Bottom-Up Data Mining with Perl by dragonchild (Archbishop) on Mar 10, 2003 at 15:32 UTC
split, unpack, and regexes are all ways to parse a given line of data. Each is useful in different circumstances. For example: split is more useful with delimited lines, such as tab-delimited or comma-delimited. (However, using a module like Text::CSV is better for delimited text. This is because of lines like "abcd,'Smith, John', blah" - the comma in the quotes is part of the item, not a delimiter.) Now, one could use a regex here, but the regex is harder to understand, and even harder to get right. `my @items = split $delim, $line; #### vs. (and I know this will make mistakes my @items = $line =~ /^?([^$delim])(?:${delim}$)?/g;` [download] unpack (if you understand how to use it!) is really good with data that is formatted, like so many columns is the first thing, so many the second, etc. This is often data from a mainframe. Again, you can use a regex here, but you have to roll it for it to be maintainable. (I'd put an unpack example here, if I was comfortable knowing how to work it.) `my @columns = ( 20, 10, 25, 5, 2, 2, 20); my $regex = map { "(.{$_})" } @columns; $regex = qr/^${regex}$/; my @items = $line =~ /$regex/;` [download] For every example I give on different parsing needs, there is a module on CPAN that does it better, faster, and safer. I personally would never hand-parse data in production. Heck, you can use CGI to parse HTML pages without even having an http server! ------ We are the carpenters and bricklayers of the Information Age.* Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l] [select]
Re: Bottom-Up Data Mining with Perl by gmax (Abbot) on Mar 05, 2003 at 18:55 UTC
Here's some food for thought. Check this article and the relative meditation on this site. My personal advice is related to database programming. You should explain when to use perl and when to delegate the task to a database engine. For example, comparing large lists of records and making statistics out of them is a task for a DBMS, rather than a candidate for a perl script. Of course, in such case you should introduce the beauty of the DBI. Also the opposite cases are interesting, when people try to do in SQL things that should be better left to perl. As an example, parsing and cleaning data before storing it into a database is a perfect job for perl, which is the best companion tool for every database administrator. `_ _ _ _ (_\|\| \| \|(_\|>< _\|` [download]	[reply] [d/l]
Re: Bottom-Up Data Mining with Perl by l2kashe (Deacon) on Mar 05, 2003 at 19:42 UTC
Alot of data processing is having your data in the right components which align with your view of it.. Or rather I would touch upon uses for arrays, hashes, AoAs, AoHs, HoAs, and HoHs, and how they work together.. Possibly use the Schwartzian Transform to illustrate the ability to use those kinds of data strucures to perfom complex sorting in an elegant way... Off the top of my head, a decent example would be a hash of arrays which contains the months names, the numerical value and how many days are in them ala `%months = ( Jan => [1, '31'], Feb => [2, '' ], Mar => [3, '31'], Apr => [4, '30'], May => [5, '31'], Jun => [6, '30'], Jul => [7, '31'], Aug => [8, '31'], Sep => [9, '30'], Oct => [10, '31'], Nov => [11, '30'], Dec => [12, '31'] ); # Get number of days in Jan $days = $months{Jan}->[1];` [download] It could also be written as a hash of hashes, but I think that shows what I mean.. I also avoided adding the logic to determine if its a leap year, and appropriatly set Feb->1, as I thoroughly hate that aspect of our calendar system.. Anyway what I was trying to say is we now have all the info we need about the months of the year.. a third element could be the month prior to, and a fourth element could be the next month.. you could extract the months in order or reverse via `# in numerical order for ( sort { $a->[0] <=> $b->[0] } keys %months ) { # or reversed for ( sort { $b->[0] <=> $a->[0] } keys %months ) {` [download] The possibilities are endless.. I see alot of people who dont fully utilize the data structures to represent their data and how it relates to itself.. Maybe they attempt to have a bunch or arrays and loop through one while pulling data from another (nothing wrong with this approach), as opposed to slapping the data into a single array of arrays. Also maybe touch upon the speed factors of using refs as opposed to passing stuff around by value, and the tradeoffs of arrays vs hashes.. If they are doing 5 extra calulations attempting to figure out which array index to get, maybe they should be using hashes, especially if its in a tight loop etc (trivial off the top of my head example).. /* And the Creator, against his better judgement, wrote man.c */	[reply] [d/l] [select]
Re: Re: Bottom-Up Data Mining with Perl by tachyon (Chancellor) on Mar 05, 2003 at 23:52 UTC
Here is a much better data structure that also shows you how you can set a hash value to the return val of a sub and thus give feb a correct value. It also shows semi dynamic hash construction using map FWIW. my @month_names = qw( undef, jan feb mar apr may jun jul aug sept oct +nov dec ); my @month_days = ( undef, 31, &feb_days, 31, 30, 31, 30, 31, 31, 30, 3 +1, 30, 31 ); my $months = { name => { map{ $_, $month_names[$_]} 1..12 }, number => { map{ $month_names[$_], $_ } 1..12 }, days => { map{ $_, $month_days[$_], $month_names[$_], $month_day +s[$_] } 1..12 }, }; print "Month number 4 is ", $months->{name}->{4}, "\n"; print "Month mar is number ", $months->{number}->{mar}, "\n"; print "Days in month feb ", $months->{days}->{feb}, "\n"; print "Days in month 4 ", $months->{days}->{4}, "\n"; use Data::Dumper; print Dumper $months; sub feb_days { check_leap_year( get_year() ) ? 29 : 28 } sub get_year { (localtime())[5] + 1900 } sub check_leap_year { my $year = shift; my $leap_year = 0; $leap_year = 1 if $year % 4 == 0; $leap_year = 0 if $year % 100 == 0 and $year % 400 != 0; return $leap_year; } [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: Re: Bottom-Up Data Mining with Perl by zengargoyle (Deacon) on Mar 06, 2003 at 00:12 UTC
and make a note that if you start the script in December of a leap year it will be wrong come February. but i guess you would use a module in that case.	[reply]
Re: Re: Re: Re: Bottom-Up Data Mining with Perl by tachyon (Chancellor) on Mar 06, 2003 at 00:55 UTC
Re: Re: Re: Re: Re: Bottom-Up Data Mining with Perl by zengargoyle (Deacon) on Mar 06, 2003 at 03:37 UTC
Re: Bottom-Up Data Mining with Perl by jryan (Vicar) on Mar 06, 2003 at 01:10 UTC
One good thing you might want to mention is the usefulness of grammars. Modules like Parse::Recdescent are enormously useful for parsing highly complex data. You could even talk about them in contrast to "The Burrito Principle" (which is the greatest buzzword ever!), since grammars are usually written from a top-down approach. However, even with grammars, you could still apply the Burrito Principle: it is usually best to work on the most complicated/important subrule within a set of subrules first, as others in the set will usually follow easily. Update: Cleared up my thoughts.	[reply]
Re: Bottom-Up Data Mining with Perl by derby (Abbot) on Mar 05, 2003 at 18:29 UTC
++ just for the "Burrito Principle" - you'll live forever in my 100% buzzword compatible life. -derby	[reply]
Inline::C by zby (Vicar) on Mar 06, 2003 at 09:05 UTC
For huge datasets there should be some some place to use Inline::C. At this node you can find an example comparison of speed of a few algorithms in Perl and one in Inline::C.	[reply]
Re: Bottom-Up Data Mining with Perl by jdporter (Paladin) on Mar 06, 2003 at 06:12 UTC
For starters, I would strongly recommend against calling your technique "data mining". Unless, of course, that's what it is, but it doesn't look like it to me. "Data mining" has a fairly specific meaning, which involves machine learning and very large datasets managed by powerful database systems. jdporter The 6th Rule of Perl Club is -- There is no Rule #6.	[reply]
Re: Re: Bottom-Up Data Mining with Perl by allolex (Curate) on Mar 06, 2003 at 07:15 UTC
Data mining also has a more general meaning of "sifting through a bunch of information (e.g. on the Internet) for useful data". The means of doing this are left up to the miner, although it usually requires the things you mentioned. -- Allolex	[reply]
Re3: Bottom-Up Data Mining with Perl by dragonchild (Archbishop) on Mar 06, 2003 at 15:24 UTC
I agree with allolex. My wife has done some amazing data-mining for job-boards when I was searching for a job. (Of course, this was all done before she learned how to cut'n'paste, but she's a smart one! grins) ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Re: Re: Bottom-Up Data Mining with Perl by jdporter (Paladin) on Mar 11, 2003 at 19:59 UTC
Obviously people use it that way, but it's still a bad idea to name your module that, for the simple reason that when someone goes searching CPAN for a data mining module, they won't want to find yours, because it doesn't do what they want it to. (`s/module/whatever/` as appropriate.) jdporter The 6th Rule of Perl Club is -- There is no Rule #6.	[reply] [d/l]
Re: Re: Bottom-Up Data Mining with Perl by diotalevi (Canon) on Mar 06, 2003 at 15:38 UTC
Incidentally - are there any good references to getting into "Data Mining" in the sense that you used? Seeking Green geeks in Minnesota	[reply]
Re: Bottom-Up Data Mining with Perl by rje (Deacon) on Mar 13, 2003 at 06:14 UTC
The skeleton of the paper is located at: http://home.attbi.com/~eaglestone/perl/DataMining.html Rob	[reply]