lee_crites has asked for the wisdom of the Perl Monks concerning the following question:

Nth Field Extraction

I am hoping for some ideas from the wonderful collective on how to do something better than I am currently doing it. Here is the issue: I need a function (it will eventually replace the one in an package I have) that will be passed a delimiter string, the count, and the source string, and return the string in the count'th field.

I cannot use split() because the delimiter might be multiple characters, and it comes in via a variable. I currently use a while loop with index, building an array with the starting positions of each field, and then use substr to grab out what I am looking for. Here is an example:

my $major_div = '!!';
my $user_div = ';';
my $var_div = ',';
my $full_string = 'abcd-efgh-ijkl-mnop;key1=data1,key2=data2;key1=data
+3,key2=data4!!qwer-asdf-zxcv-tyui;key1=data3;key3=data6!!trew-hgfd-yt
+re-bvcx;key1=data7,key2=data8;key1=data9,key2=data10!!erty-dfgh-cvbn-
+hjkl;key2=data5;key3=data6';

my $major_field = &field_split($major_div, 3, $full_string);
my $user_key = &field_split($user_div, 1, $major_field);
my $user_vars1 = &field_split($user_div, 2, $major_field);
my $user_vars2 = &field_split($user_div, 3, $major_field);
my $key_data = &field_split($var_div, 1, $user_vars1);
[download]

The example is rather ludicrous, I admit, but it shows what we are doing. Sometimes what we are doing is iterating over the string (the major fields), and processing them all, one at a time.

I have something that works, but the code is a dozen years old, and I have wanted to update it, but just haven't done it -- if it ain't broke, don't fix it, right? But this new task will be using the code a lot, so I'm hoping one of you perl masters have already tuned a function that does this, and would be willing to share it.

Thanks muchly!

David Lee Crites
lee@critesclan.com

Comment on nth field extraction Download Code

Replies are listed 'Best First'.
Re: nth field extraction by Corion (Patriarch) on Jul 27, 2018 at 07:32 UTC
I'm not sure why you can't use plain split for that? my $major_div = '!!'; my $user_div = ';'; my $var_div = ','; my $full_string = 'abcd-efgh-ijkl-mnop;key1=data1,key2=data2;key1=data +3,key2=data4!!qwer-asdf-zxcv-tyui;key1=data3;key3=data6!!trew-hgfd-yt +re-bvcx;key1=data7,key2=data8;key1=data9,key2=data10!!erty-dfgh-cvbn- +hjkl;key2=data5;key3=data6'; sub field_split { my ($sep, $field, $source) = @_; return (split /\Q$sep/, $source)[ $field-1 ] }; my $major_field = &field_split($major_div, 3, $full_string); my $user_key = &field_split($user_div, 1, $major_field); my $user_vars1 = &field_split($user_div, 2, $major_field); my $user_vars2 = &field_split($user_div, 3, $major_field); my $key_data = &field_split($var_div, 1, $user_vars1); [download] But really, I would look at using Text::CSV_XS to read in the incoming (major) data and split it up into an array, and then split up the minor fields from that.	[reply] [d/l]
Re^2: nth field extraction by lee_crites (Scribe) on Jul 27, 2018 at 14:39 UTC
Thanks for the (probably obvious) pointer into using a variable in the split command. If I was writing this today, I'd probably have checked to see if that was a possibility. I have a vague memory from back when I wrote the function (15+/- years ago), and couldn't get that construct working. I was just thinking about that, and remembered that the reason I wrote this was to deal with migrating data from a PICK system to a *nix system, back in the 90's -- so it is a tad older than 15 years... :O Thanks for the help! Lee Crites lee@critesclan.com	[reply]
Re^3: nth field extraction by AnomalousMonk (Archbishop) on Jul 29, 2018 at 18:43 UTC
Further to Marshall's post: For more info on `\Q \E` (and their `\L \l \U \u` pals), see also perlop, perlre and quotemeta. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: nth field extraction by Marshall (Canon) on Jul 29, 2018 at 18:18 UTC
Thanks for the (probably obvious) pointer into using a variable in the split command. If I was writing this today, I'd probably have checked to see if that was a possibility. I have a vague memory from back when I wrote the function (15+/- years ago), and couldn't get that construct working. You might have not recognized the significance of \Q in `return (split /\Q$sep/, $source)[ $field-1 ]);` The \Q says to ignore any characters in $sep that would otherwise mean something to the regex engine. I often use a \Q...\E pair for this just to highlight this situation. Anyway without \Q, if $sep contains something that matters to the regex engine, you will get confusing results.	[reply] [d/l]
Re: nth field extraction by anonymized user 468275 (Curate) on Jul 27, 2018 at 09:57 UTC
As I understand it you have several layers of delimiters. One idea for reusable code would be something that converts this into a multi-dimensional array -- one dimension per delimiter. E.g: `my $aref = &fieldParse($fullString, '!!', ';', ','); sub fieldParse { my $source = shift; my $ret = []; my $delim = shift; defined($delim) or return $source; for (split $delim, $source) { push @$ret, &fieldParse($_, @_); } return $ret; }` [download] which produces:- `$VAR1 = [ [ [ 'abcd-efgh-ijkl-mnop' ], [ 'key1=data1', 'key2=data2' ], [ 'key1=data +3', 'key2=data4' ] ], [ [ 'qwer-asdf-zxcv-tyui' ], [ 'key1=data3' ], [ 'key3=data6' ] ], [ [ 'trew-hgfd-yt +re-bvcx' ], [ 'key1=data7', 'key2=data8' ], [ 'key1=data9', 'key2=data10' ] ], [ [ 'erty-dfgh-cvbn- +hjkl' ], [ 'key2=data5' ], [ 'key3=data6' ] ] ];` [download] Updated (handle case of false-value delimiter as someone suggested) One world, one people	[reply] [d/l] [select]
Re^2: nth field extraction by AnomalousMonk (Archbishop) on Jul 27, 2018 at 20:51 UTC
`my $delim = shift or return $source;` This statement in `fieldParse()` makes me uneasy. The parse will fail if any `$*_div` is `'0'`. Perhaps unlikely, but still... A safer alternative IMHO would be: c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $major_div = '!!'; my $user_div = '0'; my $var_div = ','; my $full_string = join $major_div, 'abcd-efgh-ijkl-mnop0key1=data1,key2=data20key1=data3,key2=data4', 'qwer-asdf-zxcv-tyui0key1=data30key3=data6', 'trew-hgfd-ytre-bvcx0key1=data7,key2=data80key1=data9,key2=data10', 'erty-dfgh-cvbn-hjkl0key2=data50key3=data6', ; print qq{full_string: <<$full_string>> \n}; ;; my $aref = fieldParse($full_string, $major_div, $user_div, $var_div); dd $aref; ;; sub fieldParse { my $source = shift; return $source unless @_; ;; my $delim = shift; return [ map fieldParse($_, @_), split $delim, $source ]; } " full_string: <<abcd-efgh-ijkl-mnop0key1=data1,key2=data20key1=data3,ke +y2=data4!!qwer-asdf-zxcv-tyui0key1=data30key3=data 6!!trew-hgfd-ytre-bvcx0key1=data7,key2=data80key1=data9,key2=data10!!e +rty-dfgh-cvbn-hjkl0key2=data50key3=data6>> [ [ ["abcd-efgh-ijkl-mnop"], ["key1=data1", "key2=data2"], ["key1=data3", "key2=data4"], ], [["qwer-asdf-zxcv-tyui"], ["key1=data3"], ["key3=data6"]], [ ["trew-hgfd-ytre-bvcx"], ["key1=data7", "key2=data8"], ["key1=data9", "key2=data1"], ], [["erty-dfgh-cvbn-hjkl"], ["key2=data5"], ["key3=data6"]], ] [download] (This version of the function still has some vulnerabilities, but I'm a bit more comfortable with it. :) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: nth field extraction by lee_crites (Scribe) on Jul 27, 2018 at 14:27 UTC
This is exactly the direction I was thinking of going! Thanks!!! I will be digesting this The problem I'm having is that I (re)process the string multiple times. That worked okay when I was doing it a few times -- perhaps several hundred or thousand times, total, in a run. But my best guess is that it will be run something between 1,500k and 3,000k times per run. Hence my hope for ideas on a better way. Just for giggles and grins, I extracted the function I had into a standalone test script. It was probably at the top of my coding about 15+ years ago. Here it is: #!/usr/bin/env perl my $which = 3; my $div = '!!!'; my $str = 'asdf' . $div . 'qwer' . $div . 'zxcv' . $div . 'hjkl' . $di +v . 'yuio' . $div . 'vbmn'; my @stuff; my $spot = 0; my $result = index($str, $div, $spot); print "str: [$str]\n"; while ($result != -1) { print "Found '$div' at $result\n"; my $start_spot = ($spot ? $spot + length($div) - 1 : 0); my $field_length = ($spot ? $result - $spot - length($div) + 1 : $ +result - $spot); push @stuff, substr($str, $start_spot, $field_length); $spot = $result + 1; $result = index($str, $div, $spot); } print @stuff . "\n"; print '-- #' . $which . '=' . @stuff[$which-1] . "\n"; [download] I am continually amazed and pleased at the quality of the responses I get/see here on perlmonks! Thanks, y'all!!! Lee Crites lee@critesclan.com	[reply] [d/l]
Re^3: nth field extraction by anonymized user 468275 (Curate) on Jul 30, 2018 at 14:37 UTC
In that case there could be a slight performance benefit in storing results in a hash, e.g. `my %res; ... ... for my $fullString (however they are obtained) { $res{$fullString} \|\|= fieldParse( $fullString, etc. ); etc... }` [download] One world, one people	[reply] [d/l]
Re: nth field extraction by AnomalousMonk (Archbishop) on Jul 27, 2018 at 14:57 UTC
Some thoughts on the OP: You don't show the output you expect from the given input; You don't show the code of the `field_split()` function in its current state (this has been subsequently supplied here); You don't mention that the indices you're passing to the `field_split()` function are 1-based and not 0-based. All these pieces of info would have been useful as the foundation of a helpful answer. Furthermore, a Test::More testing framework based on the current implementation would have been an enticement to a quick answer as well as a convenient way to present some of the items of information mentioned above; please see How to ask better questions using Test::More and sample data. See also the Short, Self Contained, Correct (Compilable), Example. In short, please help us to help you. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]