comment on

Hello Monks,

The details of my query follow:

I have a very large TSV (Tab Sep. Value) file (where large > 30 GB).

I want to extract certain lines from this file that don't end with an empty last field. Since this is a TSV file, those lines that don't end with \t\n, which is a trivial test and not the subject of this question. That will remove some 75% of the lines, right off the bat.

Then I want to extract a small subset of fields from the remaining lines. The fields are not contiguous, but they are few in number (e.g., let's say seven out of thirty). For example, say fields 2,3,12-18,25-28,31.

The lines I am extracting from are very long, most are as long as 1,000 characters, because they contain a large number of tab delimited fields.

One option is to obviously use the following simple code, which I've tried to nicely format and include comments to show my reasoning:

use warnings;
use strict;
# I am using the latest stable version of Perl for this exercise
use 5.30.0;

while (<>)
{
  # Skip lines ending with an empty field
  next if substr($_,-2) eq "\t\n";
  
  # Remove "\n"
  chomp;
 
  # Split matching lines into fields on "\t", creating @fields
  my @fields=split(/\t/,$_);

  # Copy only the desired fields from @fields to create a new
  # line in TSV format
  # This can be done in one simple step in Perl, using
  # array slices and the join() function
  my $new_line=join("\t",@fields[2,3,12..18,25..28,31]);

  # ...
}
[download]

But, using split, although easy, leads to extra parsing (beyond the last field I need) and produces a complete array of fields which I also don't need. I think it would be more efficient to not create the array, but to parse each line looking for tabs and counting the field indexes as I go, creating the output line on the way, and stopping at the last field I need.

Am I correct in my assessment, or is just doing a simple split, followed by joined slices of the fields of interest, the best way to go here from a performance perspective?

In reply to What is the most efficient way to split a long string (see body for details/constraints)? by mikegold10

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.