in reply to Finding Minimum Value

Your script gets much much simpler by putting your data in a database and then using standard SQL to retrieve the data. When working with data and manipulating the data in various ways, always think "database"!

For example:

use Modern::Perl '2014'; use DBI; my $dbfile = 'c:/data/school.sqlite'; my $dbh = DBI->connect( "dbi:SQLite:dbname=$dbfile", "", "", { RaiseError => 1 + } ); my $sth = $dbh->prepare('INSERT INTO students (id, name) VALUES (?, ?) +'); while (<DATA>) { chomp; last if /grades.txt/; next unless $_; next if /^\*\*\*/; say $_; my ( $id, $name ) = split /:/; $sth->execute( $id, $name ); } $sth = $dbh->prepare('INSERT INTO results (id, course, score) VALUES (?, ?, + ?)'); while (<DATA>) { chomp; next unless $_; next if /^\*\*\*/; my ( $id, $course, $score ) = split /\s+/; $sth->execute( $id, $course, $score ); } my $ary_ref = $dbh->selectall_arrayref( 'SELECT name, sum(score), avg(score), min(score), max(score) FROM resu +lts JOIN students WHERE students.id = results.id GROUP BY results.id +ORDER BY name' ); printf "%20s %10s %8s %8s %8s\n", qw/Name Total Average Min Max/; for my $line (@$ary_ref) { printf "%20s %10d %8d %8d %8d\n", @$line; } $ary_ref = $dbh->selectall_arrayref( 'SELECT course, avg(score) FROM results GROUP BY course ORDER BY c +ourse'); printf "\n%10s %8s\n", qw/Course Average/; for my $line (@$ary_ref) { printf "%10d %8d\n", @$line; } __DATA__ ***students.txt*** 122334:James,Lebron: 222223:Duncan,Tim: 244668:Bryant,Kobe: 355779:Durant,Kevin: ****************** ***grades.txt*** 122334 1 98 222223 1 86 ...(snip)... 244668 5 95 355779 5 94 ****************
Output:
Name Total Average Min Max Bryant,Kobe 466 93 89 95 Duncan,Tim 462 92 86 96 Durant,Kevin 468 93 90 96 James,Lebron 485 97 96 98 Course Average 1 90 2 93 3 95 4 95 5 95

Retrieving data is just one line of Perl-code and since the data is saved in the database you can use it again and again without running the cost of parsing the input file(s).

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Replies are listed 'Best First'.
Re^2: Finding Minimum Value
by Laurent_R (Canon) on Sep 22, 2014 at 21:46 UTC
    When working with data and manipulating the data in various ways, always think "database"!

    I am sorry but I have to object very strongly to this. Especially with the word "always".

    Databases are very good for two things: data persistence and ability to manage data volumes too big for the computer memory. And also the fact that, if your database is SQL, that the SQL language is a very high level and practical language that will hide many implementation details.

    But databases also have a lot of limitations. First, they are horribly slow (compared to hashes in memory). And the languages to manipulate them, such as PL-SQL, are often also horribly slow. Of course, this probably does not matter if you have just tens of thousands of records. But when you get to millions or tens of millions of records, the difference is huge.

    So, if you don't need to store data in a persistent fashion, just probably don't use a database, or, at least, think twice before you do it.

    About a year and a half ago, I was asked to try to improve performance of a very complicated extraction process on a database. Initial duration test led to a prospective execution time of 160 days. After some profiling and benchmarking work, I was able to reduce it to about 60 days, 59.5 of which in a very complicated trans-codification process. Not too bad, but still obviously a nogo. I moved to an extract of raw data files and a reprocessing of the flat files in pure Perl. The overall extraction time fell to about 12 or 13 hours, but the trans-codification part, using half a dozen Perl hashes, fell from 59.5 days to just about an hour, i.e. an improvement of a factor of about 1,400.

    No, it is a bit more complicated. Databases are very useful, there is no doubt about it, but they are certainly not the solution to everything, far from that. Especially when performance is important.

      I said "think database", not that you always have to use one or that it is always the best option.

      However in actual practical real world cases, there is usually a need to persist the data, there is more data than you can keep in the __DATA__ section of your script or the data changes from time to time. So it pays to "think database" in many, if not most, of the day-to-day jobs.

      I am not surprised that Perl outruns any database in doing a complicated trans-codification process. That is one of Perl's main strengths!

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
        I might have over-reacted, but, true, you said "think database", not "use database". Sorry if I interpreted wrongly what you said.

        My work is to guarantee data consistency between several hundreds applications of the Information System of a Tier One telecommunication operator (cell phones, landlines, Internet, TV, VOIP, VOD, etc.). The various applications run on MVS, AS400, AIX, HP-UX, Solaris, VMS, at least half a dozen distributions of Linux, etc. The database systems used are perhaps even more varied.

        Now, how to you compare data from a DB2 database running on MVS with an Oracle database running on AIX or a RMS database running on VMS? As you might imagine, this is not straight forward. In most cases, we just extract flat CSV data files from the various applications and compare those files. There a few cases where we load the data from two applications into the same database (different tables), say Oracle, and compare the tables, using SQL queries or some other mean. But that's usually when we also need to run manual SQL queries to investigate further.

        But in most cases, though, this is too inefficient (we have about 35 million customers and the extraction files commonly have tens of gigabytes, sometimes several hundreds of GB). It is most of the time much much faster to simply sort the data from both application according to the same sorting key (using for example the Unix sort utility), and to compare them line by line with Perl to detect duplicates, orphans (that's how we call a piece of data present in one application and not in the other) and discrepancies in the details. We've done a lot of benchmarking. In most cases, the complete comparison method described just above takes less time than just loading the data into a database and constructing the indexes, without having even started comparing anything.

        I explained all this just because I wanted to let you know why I sometimes react strongly when someone suggest to use a database for such one-off processes. Sure, it is often easier when the data is small or medium, but for really large data, it simply scales much worse than the sort and compare strategy.

        Now, of course, I admit that our activity might not be typical, especially because our activity does not imply persistent data. We compare different databases at a given point in time once a week, or once a month, or according to some other calendars, but we don't need to keep data from one time to another (well, this is sort of a lie or oversimplification, because when we do a comparison, we most of the time compare its results with the results of the previous identical comparison, in order to check that the corrections we submitted previously were duly applied, or, if they were not, to investigate why not).

        Again sorry if I overreacted, I hope that my description of my job gave you the reasons for it.