lee_crites has asked for the wisdom of the Perl Monks concerning the following question:

Nth Field Extraction

I am hoping for some ideas from the wonderful collective on how to do something better than I am currently doing it. Here is the issue: I need a function (it will eventually replace the one in an package I have) that will be passed a delimiter string, the count, and the source string, and return the string in the count'th field.

I cannot use split() because the delimiter might be multiple characters, and it comes in via a variable. I currently use a while loop with index, building an array with the starting positions of each field, and then use substr to grab out what I am looking for. Here is an example:

my $major_div = '!!'; my $user_div = ';'; my $var_div = ','; my $full_string = 'abcd-efgh-ijkl-mnop;key1=data1,key2=data2;key1=data +3,key2=data4!!qwer-asdf-zxcv-tyui;key1=data3;key3=data6!!trew-hgfd-yt +re-bvcx;key1=data7,key2=data8;key1=data9,key2=data10!!erty-dfgh-cvbn- +hjkl;key2=data5;key3=data6'; my $major_field = &field_split($major_div, 3, $full_string); my $user_key = &field_split($user_div, 1, $major_field); my $user_vars1 = &field_split($user_div, 2, $major_field); my $user_vars2 = &field_split($user_div, 3, $major_field); my $key_data = &field_split($var_div, 1, $user_vars1);

The example is rather ludicrous, I admit, but it shows what we are doing. Sometimes what we are doing is iterating over the string (the major fields), and processing them all, one at a time.

I have something that works, but the code is a dozen years old, and I have wanted to update it, but just haven't done it -- if it ain't broke, don't fix it, right? But this new task will be using the code a lot, so I'm hoping one of you perl masters have already tuned a function that does this, and would be willing to share it.

Thanks muchly!

David Lee Crites
lee@critesclan.com

Replies are listed 'Best First'.
Re: nth field extraction
by Corion (Patriarch) on Jul 27, 2018 at 07:32 UTC

    I'm not sure why you can't use plain split for that?

    my $major_div = '!!'; my $user_div = ';'; my $var_div = ','; my $full_string = 'abcd-efgh-ijkl-mnop;key1=data1,key2=data2;key1=data +3,key2=data4!!qwer-asdf-zxcv-tyui;key1=data3;key3=data6!!trew-hgfd-yt +re-bvcx;key1=data7,key2=data8;key1=data9,key2=data10!!erty-dfgh-cvbn- +hjkl;key2=data5;key3=data6'; sub field_split { my ($sep, $field, $source) = @_; return (split /\Q$sep/, $source)[ $field-1 ] }; my $major_field = &field_split($major_div, 3, $full_string); my $user_key = &field_split($user_div, 1, $major_field); my $user_vars1 = &field_split($user_div, 2, $major_field); my $user_vars2 = &field_split($user_div, 3, $major_field); my $key_data = &field_split($var_div, 1, $user_vars1);

    But really, I would look at using Text::CSV_XS to read in the incoming (major) data and split it up into an array, and then split up the minor fields from that.

      Thanks for the (probably obvious) pointer into using a variable in the split command. If I was writing this today, I'd probably have checked to see if that was a possibility. I have a vague memory from back when I wrote the function (15+/- years ago), and couldn't get that construct working.

      I was just thinking about that, and remembered that the reason I wrote this was to deal with migrating data from a PICK system to a *nix system, back in the 90's -- so it is a tad older than 15 years... :O

      Thanks for the help!

      Lee Crites
      lee@critesclan.com
        Thanks for the (probably obvious) pointer into using a variable in the split command. If I was writing this today, I'd probably have checked to see if that was a possibility. I have a vague memory from back when I wrote the function (15+/- years ago), and couldn't get that construct working.

        You might have not recognized the significance of \Q in return (split /\Q$sep/, $source)[ $field-1 ]); The \Q says to ignore any characters in $sep that would otherwise mean something to the regex engine. I often use a \Q...\E pair for this just to highlight this situation. Anyway without \Q, if $sep contains something that matters to the regex engine, you will get confusing results.

Re: nth field extraction
by anonymized user 468275 (Curate) on Jul 27, 2018 at 09:57 UTC
    As I understand it you have several layers of delimiters. One idea for reusable code would be something that converts this into a multi-dimensional array -- one dimension per delimiter. E.g:
    my $aref = &fieldParse($fullString, '!!', ';', ','); sub fieldParse { my $source = shift; my $ret = []; my $delim = shift; defined($delim) or return $source; for (split $delim, $source) { push @$ret, &fieldParse($_, @_); } return $ret; }
    which produces:-
    $VAR1 = [ [ [ 'abcd-efgh-ijkl-mnop' ], [ 'key1=data1', 'key2=data2' ], [ 'key1=data +3', 'key2=data4' ] ], [ [ 'qwer-asdf-zxcv-tyui' ], [ 'key1=data3' ], [ 'key3=data6' ] ], [ [ 'trew-hgfd-yt +re-bvcx' ], [ 'key1=data7', 'key2=data8' ], [ 'key1=data9', 'key2=data10' ] ], [ [ 'erty-dfgh-cvbn- +hjkl' ], [ 'key2=data5' ], [ 'key3=data6' ] ] ];
    Updated (handle case of false-value delimiter as someone suggested)

    One world, one people

      my $delim = shift or return $source;

      This statement in  fieldParse() makes me uneasy. The parse will fail if any  $*_div is '0'. Perhaps unlikely, but still... A safer alternative IMHO would be:

      c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $major_div = '!!'; my $user_div = '0'; my $var_div = ','; my $full_string = join $major_div, 'abcd-efgh-ijkl-mnop0key1=data1,key2=data20key1=data3,key2=data4', 'qwer-asdf-zxcv-tyui0key1=data30key3=data6', 'trew-hgfd-ytre-bvcx0key1=data7,key2=data80key1=data9,key2=data10', 'erty-dfgh-cvbn-hjkl0key2=data50key3=data6', ; print qq{full_string: <<$full_string>> \n}; ;; my $aref = fieldParse($full_string, $major_div, $user_div, $var_div); dd $aref; ;; sub fieldParse { my $source = shift; return $source unless @_; ;; my $delim = shift; return [ map fieldParse($_, @_), split $delim, $source ]; } " full_string: <<abcd-efgh-ijkl-mnop0key1=data1,key2=data20key1=data3,ke +y2=data4!!qwer-asdf-zxcv-tyui0key1=data30key3=data 6!!trew-hgfd-ytre-bvcx0key1=data7,key2=data80key1=data9,key2=data10!!e +rty-dfgh-cvbn-hjkl0key2=data50key3=data6>> [ [ ["abcd-efgh-ijkl-mnop"], ["key1=data1", "key2=data2"], ["key1=data3", "key2=data4"], ], [["qwer-asdf-zxcv-tyui"], ["key1=data3"], ["key3=data6"]], [ ["trew-hgfd-ytre-bvcx"], ["key1=data7", "key2=data8"], ["key1=data9", "key2=data1"], ], [["erty-dfgh-cvbn-hjkl"], ["key2=data5"], ["key3=data6"]], ]
      (This version of the function still has some vulnerabilities, but I'm a bit more comfortable with it. :)


      Give a man a fish:  <%-{-{-{-<

      This is exactly the direction I was thinking of going! Thanks!!! I will be digesting this

      The problem I'm having is that I (re)process the string multiple times. That worked okay when I was doing it a few times -- perhaps several hundred or thousand times, total, in a run. But my best guess is that it will be run something between 1,500k and 3,000k times per run. Hence my hope for ideas on a better way.

      Just for giggles and grins, I extracted the function I had into a standalone test script. It was probably at the top of my coding about 15+ years ago. Here it is:

      #!/usr/bin/env perl my $which = 3; my $div = '!!!'; my $str = 'asdf' . $div . 'qwer' . $div . 'zxcv' . $div . 'hjkl' . $di +v . 'yuio' . $div . 'vbmn'; my @stuff; my $spot = 0; my $result = index($str, $div, $spot); print "str: [$str]\n"; while ($result != -1) { print "Found '$div' at $result\n"; my $start_spot = ($spot ? $spot + length($div) - 1 : 0); my $field_length = ($spot ? $result - $spot - length($div) + 1 : $ +result - $spot); push @stuff, substr($str, $start_spot, $field_length); $spot = $result + 1; $result = index($str, $div, $spot); } print @stuff . "\n"; print '-- #' . $which . '=' . @stuff[$which-1] . "\n";

      I am continually amazed and pleased at the quality of the responses I get/see here on perlmonks! Thanks, y'all!!!

      Lee Crites
      lee@critesclan.com
        In that case there could be a slight performance benefit in storing results in a hash, e.g.
        my %res; ... ... for my $fullString (however they are obtained) { $res{$fullString} ||= fieldParse( $fullString, etc. ); etc... }

        One world, one people

Re: nth field extraction
by AnomalousMonk (Archbishop) on Jul 27, 2018 at 14:57 UTC

    Some thoughts on the OP:

    • You don't show the output you expect from the given input;
    • You don't show the code of the  field_split() function in its current state (this has been subsequently supplied here);
    • You don't mention that the indices you're passing to the  field_split() function are 1-based and not 0-based.
    All these pieces of info would have been useful as the foundation of a helpful answer. Furthermore, a Test::More testing framework based on the current implementation would have been an enticement to a quick answer as well as a convenient way to present some of the items of information mentioned above; please see How to ask better questions using Test::More and sample data. See also the Short, Self Contained, Correct (Compilable), Example.

    In short, please help us to help you.


    Give a man a fish:  <%-{-{-{-<