Extract data from CSV field.

JobC has asked for the wisdom of the Perl Monks concerning the following question:

I am new at using perl. I have a CSV that I am importing with Text::CSV_XS. Currently, I am able to clean up much of the poorly formed data generated by the venerable database output tool (thanks perl people). I would like to extract a piece of data from one field that has multiple fields in it. The original field is a long description that usually contains a #F123456, #123456, #123-F123456, #123-123456, or #12AB-123456 in it. This data floats around from left to right and there should be whitespace before the #. Also, the end of the data is either whitespace, or the end of the field. I could extract the data before or after I clean up the field. Here is a couple of examples of the original and modified data.

#Before:
,"TRAY,HINGED,PLSTC,20 CAV 
#F32473",

#After:
,"TRAY HINGED PLSTC 20 CAV #F32473",

#Before:
,"BOX HSC,35-3/4X17-1/4 X 50-1/2
SIMULATOR TALL BOX",

#After:
,"BOX HSC 35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX",

#Before:
,"PAD, FOAM, 24 X 24 X 1/4                #16193                      
+             
112 SHEETS PER ROLL, ORDER IN FULL ROLLS",

#After:
,"PAD FOAM 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL ORDER IN FULL ROLL
+S",

#Before:
,"PKG LIST,ASST ARM,RAD,300
#F37784",

#After:
,"PKG LIST ASST ARM RAD 300 #F37784",

#Before:
,"PAD, TOP CAP RE17-30048          #F30121                            
+          
CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X 4-3/4",

#After:
,"PAD TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY 22-7/8 X 21-1/8 X
+ 4-3/4","11.72000","281.28",
[download]

Here is some of the code I am using to process the file. I don't know if I can share all of it without quite a bit of obfuscation. This field would be "$row->5"

 while (my $row = $aCSV->getline ($ORIG_CSV)) {
    #Test row and determine if there is data. If the first element is 
+empty then that is the last line.  
    #This should prevent errors, and drop the extraneous data at the e
+nd of the file.
    if ($row->[0] =~ /\S/ ) {
         # Remove extra newline from field 5, 6 and 11 unless preceded
+ by a carriage return, fields start counting at Zero. 
         # Field 6 and 11 replaces with space for readabililty of the 
+data.
         $row->[4]  =~ s/ (?<!\x0D) \x0A //gx;
         $row->[5]  =~ s/ (?<!\x0D) \x0A / /gx;
         $row->[10] =~ s/ (?<!\x0D) \x0A / /gx;
         # cleanup or truncate fields. We only want data that is usabl
+e.
         # these are replacements so if the string within \Q \E doens'
+t exist it isn't replaced
         # They are specific for each field. REMEMBER Fields start cou
+nting at ZERO
         $row->[5]  =~ s/ \Q,\E / /gx; # replace comma in field with s
+pace. This makes CSV easier to import. 
         $row->[9]  =~ s/ \QRevision : \E //gx;
         $row->[10] =~ s/ \Q,\E / /gx; # replace comma in field with s
+pace. This makes CSV easier to import.
         $row->[12] =~ s/ \QProduction Item: \E//gx;
         $row->[13] =~ s/ \x23 //gx;    # hexidecimal 0x23 is "#" and 
+this messes with the regex in one step, but two works.
         $row->[13] =~ s/ \QWO  : \E //gx; # remove the remainder junk
+ "WO  : "
         $row->[14] =~ s/ \QCutlist color: \E //gx;
         # Some fields have extra whitespace. Lets clean that up becau
+se we want to print some of this data. 
         $row->[5]  =~ s/\h+/ /g;
         $row->[10] =~ s/\h+/ /g;
         
         # Something to test on $row->[5] /#[^ ]+/
         
         # Testing print some fields. REMEMBER 1 aka Zero
         print "$row->[5]\t$row->[9]\t$row->[12]\t$row->[13]\n";
         push @rows, $row;
         } else { 
              last;
              }
    }
[download]

Since I am new at this please tell me if I should do something different for clarity. Also, if there is a more efficient way to do this then clarity probably wins out, for later troubleshooting. Thanks!

Comment on Extract data from CSV field. Select or Download Code

Replies are listed 'Best First'.
Re: Extract data from CSV field. by ww (Archbishop) on Dec 08, 2015 at 22:32 UTC
I'm not sure this is responsive, but I'm trying to ignore the fields other than those for which you give an example, in the surmise that that's your problem area: #!/usr/bin/perl -w use 5.018; use strict; #1149716 =head I would like to extract a piece of data from one field that has multip +le fields in it. The original field is a long description that usuall +y contains a #F123456, #123456, #123-F123456, #123-123456, or #12AB-1 +23456 in it. This data floats around from left to right and there sho +uld be whitespace before the #. Also, the end of the data is either w +hitespace, or the end of the field. =cut my @data = ("TRAY HINGED PLSTC 20 CAV #F32473", "BOX HSC,35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX", "PAD, FOAM, 24 X 24 X 1/4 #16193 + 112 SHEETS PER ROLL, ORDER IN FULL ROLLS", "PKG LIST,ASST ARM,RAD,300 #F37784", "PAD, TOP CAP RE17-30048 #F30121 + CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X 4-3/4", "foo bar #379460 best F11", "F1234 SIMULATION", ); for my $data (@data) { # say "\t\|$data\|\n\n"; chomp $data; if ( $data =~ /\n/ ) { $data =~ s/\n//g; } if ( $data =~ /(^.* #[A-Z]\d+.$)/m ) { say "\n\$data matches regex\n"; $data =~ s/ +/ /g; # clean up excess spaces say "$data \n"; } else { say "\n\t The data, $data, does NOT MATCH\n"; } } [download] The regular expression may be obscure: here's an explanation: C:perl -MYAPE::Regex::Explain -e " print YAPE::Regex::Explain->new(qr/ +(^.* #[A-Z]\d+.$)/)->explain();" The regular expression: (?-imsx:(^.* #[A-Z]\d+.$)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: # NB: I did NOT need the parens as there's no use of the capture # My bad, but harmless except for shoving bits &amp bytes around # when they didn't need to be disturbed. ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- # ' #' ---------------------------------------------------------------------- [A-Z]* any character of: 'A' to 'Z' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \d+ digits (0-9) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- <p>And the output is thus:</p> <c>C:1149716.pl $data matches regex TRAY HINGED PLSTC 20 CAV #F32473 The data, BOX HSC,35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX, + does NOT MATCH $data matches regex PAD, FOAM, 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL, ORDER IN FULL ROL +LS $data matches regex PKG LIST,ASST ARM,RAD,300 #F37784 $data matches regex PAD, TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X + 4-3/4 $data matches regex foo bar #379460 best F11 The data, F1234 SIMULATION, does NOT MATCH [download] and here's the output of my code: `$data matches regex TRAY HINGED PLSTC 20 CAV #F32473 The data, BOX HSC,35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX, + does NOT MATCH $data matches regex PAD, FOAM, 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL, ORDER IN FULL ROL +LS $data matches regex PKG LIST,ASST ARM,RAD,300 #F37784 $data matches regex PAD, TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X + 4-3/4 $data matches regex foo bar #379460 best F11 The data, F1234 SIMULATION, does NOT MATCH` [download] HTH. Sometimes you'll get better answers if you trim your code to the mere few (<20) lines that demonstrate only the problem you want to address. I see you want more than what's here in terms of advice on the code you supplied but don't have time to try to create jumbled CSV that would give a shot at assessing the efficiency and/or clarity. Spirit of the Monastery	[reply] [d/l] [select]
Re: Extract data from CSV field. by poj (Abbot) on Dec 08, 2015 at 22:04 UTC
I'm guessing you want to extract just the #number part from a string. `#!perl use strict; my @data = ( "TRAY HINGED PLSTC 20 CAV #F32473", "BOX HSC 35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX", "PAD FOAM 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL ORDER IN FULL ROLLS +", "PKG LIST ASST ARM RAD 300 #F37784", "PAD TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY 22-7/8 X 21-1/8 X +4-3/4"); for my $str (@data){ if ($str =~ /(#[^ ]+)/){ print "'$1'\n"; } else { print "No match\n"; } }` [download] poj	[reply] [d/l]
Re^2: Extract data from CSV field. by JobC (Acolyte) on Dec 08, 2015 at 23:49 UTC
ww and stevieb I think poj has what I am looking to do. I was being thrown on how to handle the "#" since it is a comment operator. I thought I had to escape it. I'll test this and see if it solves my problem. Thanks!	[reply]
Re^3: Extract data from CSV field. by JobC (Acolyte) on Dec 09, 2015 at 17:31 UTC
This snippet correctly identifies the section of the CSV I want to extract. But I am doing perl wrong. I keep getting a "true" or "1" in my test output. `my $var = $row->[5] =~ /(#[^ ]+)/;` [download] I can do a replace with this code and overwrite the data I want to keep, I would bet I could use the "not" operator and replace only the other data, but I haven't tried it, and that may not be desirable for reporting purposes later on. How do I assign this "found" value to a new field in the modified dataset? I think I am missing something in my understanding of the syntax.	[reply] [d/l]
Re^4: Extract data from CSV field. by MidLifeXis (Monsignor) on Dec 09, 2015 at 18:15 UTC
Re^4: Extract data from CSV field. by poj (Abbot) on Dec 09, 2015 at 17:51 UTC
Re^5: Extract data from CSV field. by JobC (Acolyte) on Dec 09, 2015 at 18:15 UTC
Some notes below your chosen depth have not been shown here
Re: Extract data from CSV field. by stevieb (Canon) on Dec 08, 2015 at 21:21 UTC
Hi JobC, welcome to the Monastery! After writing my answer, it appears as though the `#after` lines are what you have already cleaned up (NOT your desired output). If that's correct, the below response is useless. Please clarify, and provide a few examples of what the final output should look like... I may be under-thinking this, but could it be as simple as the following? Let us know if it does what you want. I simulate each one of your rows with an element in the `@strings` array: `use warnings; use strict; my @strings = ( ',"PAD, FOAM, 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL, ORDER IN FULL ROLLS",', ',"PKG LIST,ASST ARM,RAD,300 #F37784",', ',"PAD, TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X 4-3/4",' ); for (@strings){ s/\s+/ /g; print "$_\n\n"; }` [download] Output: `,"PAD, FOAM, 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL, ORDER IN FULL R +OLLS", ,"PKG LIST,ASST ARM,RAD,300 #F37784", ,"PAD, TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 + X 4-3/4",` [download]	[reply] [d/l] [select]
Re: Extract data from CSV field. by kcott (Archbishop) on Dec 09, 2015 at 07:27 UTC
G'day JobC, Welcome to the Monastery. I believe you can perform all the modifications with this transliteration: `y/, \n/ /s` [download] Here's my test: `#!/usr/bin/env perl -l use strict; use warnings; my @data = ( q{"TRAY,HINGED,PLSTC,20 CAV #F32473"}, q{"BOX HSC,35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX"}, q{"PAD, FOAM, 24 X 24 X 1/4 #16193 + 112 SHEETS PER ROLL, ORDER IN FULL ROLLS"}, q{"PKG LIST,ASST ARM,RAD,300 #F37784"}, q{"PAD, TOP CAP RE17-30048 #F30121 + CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X 4-3/4"} ); print 'Before:'; print for @data; print 'After:'; for (@data) { y/, \n/ /s; print; }` [download] Output: Before: "TRAY,HINGED,PLSTC,20 CAV #F32473" "BOX HSC,35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX" "PAD, FOAM, 24 X 24 X 1/4 #16193 + 112 SHEETS PER ROLL, ORDER IN FULL ROLLS" "PKG LIST,ASST ARM,RAD,300 #F37784" "PAD, TOP CAP RE17-30048 #F30121 + CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X 4-3/4" After: "TRAY HINGED PLSTC 20 CAV #F32473" "BOX HSC 35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX" "PAD FOAM 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL ORDER IN FULL ROLLS +" "PKG LIST ASST ARM RAD 300 #F37784" "PAD TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY 22-7/8 X 21-1/8 X +4-3/4" [download] I've made two assumptions: The commas surrounding each `"..."` are field separators; so, given `,"...",` the actual field is `"..."` The `"11.72000","281.28",` at the end of your last "#After:" shouldn't be there: I'm guessing a tad too much copying prior to pasting. See "perlop: Quote-Like Operators" for details of `y///`. Also note that `y///` and `tr///` are synonymous. You'll probably find that `y///` is faster than `s///` (see "perlperf - Perl Performance and Optimization Techniques"). — Ken	[reply] [d/l] [select]
Re^2: Extract data from CSV field. by JobC (Acolyte) on Dec 09, 2015 at 17:36 UTC
kcott, thanks for that answer. I'll read that document. As this script will probably need to handle much more than my test data I would like it to be efficient. Oh, and your assumptions were correct.	[reply]
Re: Extract data from CSV field. by JobC (Acolyte) on Dec 08, 2015 at 20:46 UTC
I was thinking about this some more after reading "perldoc -q string". It seems like I could create a while loop that walks through the field one character at a time when it reaches the "#" it writes to a new string until it reaches whitespace or the end of the field. However, it also seems like this should be a function of perl or its modules. The perldoc was helpful but didn't directly answer my questions. Perhaps something like a word function?	[reply]