Re: Bottom-Up Data Mining with Perl
by dragonchild (Archbishop) on Mar 05, 2003 at 18:42 UTC
|
Right off the top of my head, some things to include would be:
- The flip-flop operator
- The ideas of $\, $", and others.
- chomp vs. chop and where each is good
- Templates. They're not just for HTML! (Useful for reading as well as writing.)
- Data-driven parsing.
- Functional parsing (different from data-driven). tilly wrote something very cool on this topic regarding HTML-like parsing with functional programming.
- When to use a regex vs split vs unpack.
- How to use unpack! (I still don't get how to use it ...)
- The Burrito principle. (Very cool!)
Post your paper on PM when it's done. I would love to read it!
------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified. | [reply] |
|
Could you expand on how split, pack, unpack and regexes are related? I feel there's something to what you say, but I can't at all pin it down.
| [reply] |
|
split, unpack, and regexes are all ways to parse a given line of data. Each is useful in different circumstances. For example:
- split is more useful with delimited lines, such as tab-delimited or comma-delimited. (However, using a module like Text::CSV is better for delimited text. This is because of lines like "abcd,'Smith, John', blah" - the comma in the quotes is part of the item, not a delimiter.) Now, one could use a regex here, but the regex is harder to understand, and even harder to get right.
my @items = split $delim, $line;
#### vs. (and I know this will make mistakes
my @items = $line =~ /^?([^$delim]*)(?:${delim}$)?/g;
- unpack (if you understand how to use it!) is really good with data that is formatted, like so many columns is the first thing, so many the second, etc. This is often data from a mainframe.
Again, you can use a regex here, but you have to roll it for it to be maintainable. (I'd put an unpack example here, if I was comfortable knowing how to work it.)
my @columns = ( 20, 10, 25, 5, 2, 2, 20);
my $regex = map { "(.{$_})" } @columns;
$regex = qr/^${regex}$/;
my @items = $line =~ /$regex/;
For every example I give on different parsing needs, there is a module on CPAN that does it better, faster, and safer. I personally would never hand-parse data in production. Heck, you can use CGI to parse HTML pages without even having an http server!
------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] [d/l] [select] |
Re: Bottom-Up Data Mining with Perl
by gmax (Abbot) on Mar 05, 2003 at 18:55 UTC
|
Here's some food for thought.
Check
this article and the relative meditation on this site.
My personal advice is related to database programming.
You should explain when to use perl and when to delegate the task to a database engine. For example, comparing large lists of records and making statistics out of them is a task for a DBMS, rather than a candidate for a perl script. Of course, in such case you should introduce the beauty of the DBI.
Also the opposite cases are interesting, when people try to do in SQL things that should be better left to perl. As an example, parsing and cleaning data before storing it into a database is a perfect job for perl, which is the best companion tool for every database administrator.
_ _ _ _
(_|| | |(_|><
_|
| [reply] [d/l] |
Re: Bottom-Up Data Mining with Perl
by l2kashe (Deacon) on Mar 05, 2003 at 19:42 UTC
|
Alot of data processing is having your data in the right components which align with your view of it.. Or rather I would touch upon uses for arrays, hashes, AoAs, AoHs, HoAs, and HoHs, and how they work together.. Possibly use the Schwartzian Transform to illustrate the ability to use those kinds of data strucures to perfom complex sorting in an elegant way...
Off the top of my head, a decent example would be a hash of arrays which contains the months names, the numerical value and how many days are in them ala
%months = (
Jan => [1, '31'], Feb => [2, '' ], Mar => [3, '31'],
Apr => [4, '30'], May => [5, '31'], Jun => [6, '30'],
Jul => [7, '31'], Aug => [8, '31'], Sep => [9, '30'],
Oct => [10, '31'], Nov => [11, '30'], Dec => [12, '31']
);
# Get number of days in Jan
$days = $months{Jan}->[1];
It could also be written as a hash of hashes, but I think that shows what I mean.. I also avoided adding the logic to determine if its a leap year, and appropriatly set Feb->1, as I thoroughly *hate* that aspect of our calendar system..
Anyway what I was trying to say is we now have all the info we need about the months of the year.. a third element could be the month prior to, and a fourth element could be the next month.. you could extract the months in order or reverse via
# in numerical order
for ( sort { $a->[0] <=> $b->[0] } keys %months ) {
# or reversed
for ( sort { $b->[0] <=> $a->[0] } keys %months ) {
The possibilities are endless.. I see alot of people who dont fully utilize the data structures to represent their data and how it relates to itself.. Maybe they attempt to have a bunch or arrays and loop through one while pulling data from another (nothing wrong with this approach), as opposed to slapping the data into a single array of arrays.
Also maybe touch upon the speed factors of using refs as opposed to passing stuff around by value, and the tradeoffs of arrays vs hashes.. If they are doing 5 extra calulations attempting to figure out which array index to get, maybe they should be using hashes, especially if its in a tight loop etc (trivial off the top of my head example)..
/* And the Creator, against his better judgement, wrote man.c */ | [reply] [d/l] [select] |
|
my @month_names = qw( undef, jan feb mar apr may jun jul aug sept oct
+nov dec );
my @month_days = ( undef, 31, &feb_days, 31, 30, 31, 30, 31, 31, 30, 3
+1, 30, 31 );
my $months = {
name => { map{ $_, $month_names[$_]} 1..12 },
number => { map{ $month_names[$_], $_ } 1..12 },
days => { map{ $_, $month_days[$_], $month_names[$_], $month_day
+s[$_] } 1..12 },
};
print "Month number 4 is ", $months->{name}->{4}, "\n";
print "Month mar is number ", $months->{number}->{mar}, "\n";
print "Days in month feb ", $months->{days}->{feb}, "\n";
print "Days in month 4 ", $months->{days}->{4}, "\n";
use Data::Dumper;
print Dumper $months;
sub feb_days { check_leap_year( get_year() ) ? 29 : 28 }
sub get_year { (localtime())[5] + 1900 }
sub check_leap_year {
my $year = shift;
my $leap_year = 0;
$leap_year = 1 if $year % 4 == 0;
$leap_year = 0 if $year % 100 == 0 and $year % 400 != 0;
return $leap_year;
}
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] [d/l] |
|
| [reply] |
|
|
Re: Bottom-Up Data Mining with Perl
by jryan (Vicar) on Mar 06, 2003 at 01:10 UTC
|
One good thing you might want to mention is the usefulness of grammars. Modules like Parse::Recdescent are enormously useful for parsing highly complex data. You could even talk about them in contrast to "The Burrito Principle" (which is the greatest buzzword ever!), since grammars are usually written from a top-down approach. However, even with grammars, you could still apply the Burrito Principle: it is usually best to work on the most complicated/important subrule within a set of subrules first, as others in the set will usually follow easily.
Update: Cleared up my thoughts.
| [reply] |
Re: Bottom-Up Data Mining with Perl
by derby (Abbot) on Mar 05, 2003 at 18:29 UTC
|
++ just for the "Burrito Principle" - you'll live forever in my 100% buzzword compatible life.
-derby | [reply] |
Inline::C
by zby (Vicar) on Mar 06, 2003 at 09:05 UTC
|
For huge datasets there should be some some place to use Inline::C. At this node you can find an example comparison of speed of a few algorithms in Perl and one in Inline::C. | [reply] |
Re: Bottom-Up Data Mining with Perl
by jdporter (Paladin) on Mar 06, 2003 at 06:12 UTC
|
| [reply] |
|
| [reply] |
|
I agree with allolex. My wife has done some amazing data-mining for job-boards when I was searching for a job. (Of course, this was all done before she learned how to cut'n'paste, but she's a smart one! *grins*)
------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] |
|
| [reply] [d/l] |
|
| [reply] |
Re: Bottom-Up Data Mining with Perl
by rje (Deacon) on Mar 13, 2003 at 06:14 UTC
|
| [reply] |