toughhou has asked for the wisdom of the Perl Monks concerning the following question:

I'm doing Data Warehourse related work. I have lots of files (size is about 1G for each), which are extracted from Oracle and need to be processed. Here I can use Java and Perl based on the host environment. But I'm not sure which kind of language can have a better performance to process such kind of big text file? Anyone can give me some suggestions? Thanks a lot.
  • Comment on to process big text file, use java or perl?

Replies are listed 'Best First'.
Re: to process big text file, use java or perl?
by AppleFritter (Vicar) on Jul 11, 2014 at 17:37 UTC

    Both Java and Perl can perform well. That said, a lot depends on what sort of processing you'd need to do; in particular, whether your workload's CPU-bound or IO-bound.

    Personally I'd recommend Perl; it's flexible, it'll allow for a rapid development cycle, and text processing is one of its major strengths. (That said, I'm hardly neutral, and this is PerlMonks, after all.)

    If performance is so critical, it might also pay to measure rather than guess: give both languages a try, and see which performs better.

    Also, the search box on this site is your friend! Give it a try, you'll find nodes like Perl and Java comparison and others that may be helpful. :)

Re: to process big text file, use java or perl?
by Your Mother (Archbishop) on Jul 11, 2014 at 21:02 UTC

    Out of waffles. For such a loose specification: use Perl. IO bound stuff is unlikely to see much difference in performance unless you write the code badly in one or the other. Perl’s regexes are easier, deeper, and probably faster (depends on which perl 5.#.# you’ve got) if you write them well.

    If you want a more serious answer you should ask the question with sample data, code, and more information about the actual (pre|post) processing problems involved.

Re: to process big text file, use java or perl?
by locked_user sundialsvc4 (Abbot) on Jul 11, 2014 at 17:44 UTC

    Well, first of all, “1 gigabyte,” by today’s standards, is not particularly “big.”   My laptop can slurp 16 times that amount of data into its RAM.

    But, secondly, it really all depends on how you go about it.   For instance, you probably don’t want to, and certainly don’t need to, suck-up all of that data all at once.   You probably want to process it a line at a time.   And even if the structure of a particular file is not such that you can break it sensibly into “lines,” it most certainly possesses some kind of internal structure that will enable you to read selected portions of it into memory.   Therefore, files of arbitrary size can be processed, and it does not matter in the slightest which programming-language you use.

    And yet, these days, and with both languages, that’s somewhat beside-the-point.   You want to find a way to do as little original work as possible ... by standing on the shoulders of giants.   “Do not do a thing already done, whatever it is.”

    Both Java and Perl have a rich complement of third-party modules ... of “stuff that you didn’t(!) have to write and debug ... to help you along with whatever-it-is that you happen to be doing.   Therefore, this is where you should focus a lot of your attention.   Perl’s library is called CPAN, and Java actually has several libraries in common use.   The size of the file really does not matter.   Isn’t there an existing library that will help you with this?   Might there be one which exactly-matches the description of this (as it turns out, not so unique ...) task that you have been assigned?   You need to determine this.

    “So, which one?”   Well, if you are already more-familiar with one, you probably want to go with that one.   And if not, look at the preceding paragraph.

    If you are not yet sure how to proceed ... step back and do a little research.   Investigate how you might most-efficiently get your job done given either of the two scenarios, without immediately committing to either one.   Then, make the choice that is most-appropriate for you.

Re: to process big text file, use java or perl?
by bulrush (Scribe) on Jul 12, 2014 at 16:04 UTC
    First what do you know best, and will that work?

    I like Perl because it has regex. I'm not sure of the native regex support for Java. And Perl is SUPER fast on large text files.

    I've read and looked for keys, line by line, in a 500,000 line file from Excel (saved to tab-delimited). It took about 3-5 seconds. Your 1GB files should not be a problem unless you're on a real old server or PC with less memory.

    Perl 5.8.8 on Redhat Linux RHEL 5.5.56 (64-bit)