UPDATED EXAMPLE... Below is a program that will split a string into an array. Then it scans ARRAY ELEMENT 2 for a '?' then a word 'ekey='. If it finds that stuff, then it will strip out the text after 'ekey=' and add it as a new array element. Finally, it converts the array back to a string.

Problem: I have millions of input records of web data of 550 columns, variable and often null fields, as well as very long fields. I need to scan 10 columns throughout each record looking for 4 different 'scanwords'. If found I always remove the data from the original field, and add the text after the scanword as a new element. Obviously this is very slow.

I have seen something like this on the interwebs: $a=split(/\s+/, $line))[ 3 ] which I assume grabs just a certain column. Could I grab all 10 columns with similar syntax? Would it be faster? How would I update the fields in the original record?

HELP!

#!/usr/bin/perl use strict; use warnings; my $scanword; my $wk1; my $wk2; my $wk3; my $wk4; my $wk5; my $wk6; my @elements; my $i; my $n1; my $inrec1='0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate +=20150210;7;8;9'; my $inrec2='0;1;2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015021 +0;3;4;5;6;7;8;9'; my $outrec; ##### sub special() { $wk1=''; $wk2=''; $wk3=''; $wk4=''; $wk5=''; $wk6=''; #reset + vars $wk1=$elements[$i]; #field + number to scan $a=index($wk1,"?"); #look +for ? if ( $a != -1 ) { #? was + found #Look for $scanword $wk2=index($wk1,"$scanword"); #find +start of scanword. I.E. "ekey=" $wk5=index($wk1,"=",$wk2); #find +start of "=" $wk3=index($wk1,"&",$wk2); #find +start of next & if ( $wk2 != -1 ) { #found + scanword? if ( $wk3 == -1) { $wk3 = length ($wk1); } #defau +lt to length of string if ampersand not found $wk4 = substr($wk1,$wk2,$wk3-$wk2); #wk1 i +s the field, wk2 is start of scanword, wk3 is the end position $wk6 = substr($wk1,$wk5+1,$wk3-$wk5-1); #wk1 i +s the field, wk5+1 is byte after = in the scanword, wk3 is the end po +sition print STDOUT "array element = $elements[$i] \n"; print STDOUT "Found ? in the field at offset $a \n"; print STDOUT "Found $scanword in the field at offset $wk2 \n" +; print STDOUT "Found end in the field at offset $wk3 \n" +; print STDOUT "Field wk4 = $wk4 \n"; print STDOUT "Field wk6 = $wk6 \n"; $b=$wk3-$wk5-1; #lengt +h to blank substr($elements[$i],$wk5+1,$wk3-$wk5-1) = ' ' x $b; #move +blanks to array element } } } ##### print STDOUT "inrec1=$inrec1 \n"; #print + inrec @elements = split(';', $inrec1, -1); #split + by semicolon, -1 means to keep trailing fields if empty $i=2; #eleme +nt offset $scanword="ekey="; #text +to scan for special(); #call +routine $n1=$wk6; #what +was stripped out push (@elements, $n1); #add t +o end of array $outrec = join(";",@elements); #conve +rt to output record print STDOUT "outrec=$outrec \n"; #print + output record print STDOUT "inrec2=$inrec2 \n"; #print + inrec @elements = split(';', $inrec2, -1); #split + by semicolon, -1 means to keep trailing fields if empty $i=2; #eleme +nt offset $scanword="ekey="; #text +to scan for special(); #call +routine $n1=$wk6; #what +was stripped out push (@elements, $n1); #add t +o end of array $outrec = join(";",@elements); #conve +rt to output record print STDOUT "outrec=$outrec \n"; #print + output record exit; OUTPUT EXAMPLES: ekey is not in elements[2] so do nothing inrec1=0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015 +0210;7;8;9 outrec=0;1;2;3;4;5;6?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=2015 +0210;7;8;9; ekey is in elements[2] so blank it in input record and add move follow +ing text BOZOTHECLOWN to a new element in array inrec2=0;1;2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate=20150210;3;4 +;5;6;7;8;9 array element = 2?link=misc/redirect/ekey=BOZOTHECLOWN&dmsdate +=20150210 Found ? in the field at offset 1 Found ekey= in the field at offset 21 Found end in the field at offset 38 Field wk4 = ekey=BOZOTHECLOWN Field wk6 = BOZOTHECLOWN outrec=0;1;2?link=misc/redirect/ekey= &dmsdate=20150210;3;4 +;5;6;7;8;9;BOZOTHECLOWN

In reply to split, manipulate, join by jc.smith3

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.