comment on

I am new at using perl. I have a CSV that I am importing with Text::CSV_XS. Currently, I am able to clean up much of the poorly formed data generated by the venerable database output tool (thanks perl people). I would like to extract a piece of data from one field that has multiple fields in it. The original field is a long description that usually contains a #F123456, #123456, #123-F123456, #123-123456, or #12AB-123456 in it. This data floats around from left to right and there should be whitespace before the #. Also, the end of the data is either whitespace, or the end of the field. I could extract the data before or after I clean up the field. Here is a couple of examples of the original and modified data.

#Before:
,"TRAY,HINGED,PLSTC,20 CAV 
#F32473",

#After:
,"TRAY HINGED PLSTC 20 CAV #F32473",

#Before:
,"BOX HSC,35-3/4X17-1/4 X 50-1/2
SIMULATOR TALL BOX",

#After:
,"BOX HSC 35-3/4X17-1/4 X 50-1/2 SIMULATOR TALL BOX",

#Before:
,"PAD, FOAM, 24 X 24 X 1/4                #16193                      
+             
112 SHEETS PER ROLL, ORDER IN FULL ROLLS",

#After:
,"PAD FOAM 24 X 24 X 1/4 #16193 112 SHEETS PER ROLL ORDER IN FULL ROLL
+S",

#Before:
,"PKG LIST,ASST ARM,RAD,300
#F37784",

#After:
,"PKG LIST ASST ARM RAD 300 #F37784",

#Before:
,"PAD, TOP CAP RE17-30048          #F30121                            
+          
CORRUGATED ASSEMBLY, 22-7/8 X 21-1/8 X 4-3/4",

#After:
,"PAD TOP CAP RE17-30048 #F30121 CORRUGATED ASSEMBLY 22-7/8 X 21-1/8 X
+ 4-3/4","11.72000","281.28",
[download]

Here is some of the code I am using to process the file. I don't know if I can share all of it without quite a bit of obfuscation. This field would be "$row->5"

 while (my $row = $aCSV->getline ($ORIG_CSV)) {
    #Test row and determine if there is data. If the first element is 
+empty then that is the last line.  
    #This should prevent errors, and drop the extraneous data at the e
+nd of the file.
    if ($row->[0] =~ /\S/ ) {
         # Remove extra newline from field 5, 6 and 11 unless preceded
+ by a carriage return, fields start counting at Zero. 
         # Field 6 and 11 replaces with space for readabililty of the 
+data.
         $row->[4]  =~ s/ (?<!\x0D) \x0A //gx;
         $row->[5]  =~ s/ (?<!\x0D) \x0A / /gx;
         $row->[10] =~ s/ (?<!\x0D) \x0A / /gx;
         # cleanup or truncate fields. We only want data that is usabl
+e.
         # these are replacements so if the string within \Q \E doens'
+t exist it isn't replaced
         # They are specific for each field. REMEMBER Fields start cou
+nting at ZERO
         $row->[5]  =~ s/ \Q,\E / /gx; # replace comma in field with s
+pace. This makes CSV easier to import. 
         $row->[9]  =~ s/ \QRevision : \E //gx;
         $row->[10] =~ s/ \Q,\E / /gx; # replace comma in field with s
+pace. This makes CSV easier to import.
         $row->[12] =~ s/ \QProduction Item: \E//gx;
         $row->[13] =~ s/ \x23 //gx;    # hexidecimal 0x23 is "#" and 
+this messes with the regex in one step, but two works.
         $row->[13] =~ s/ \QWO  : \E //gx; # remove the remainder junk
+ "WO  : "
         $row->[14] =~ s/ \QCutlist color: \E //gx;
         # Some fields have extra whitespace. Lets clean that up becau
+se we want to print some of this data. 
         $row->[5]  =~ s/\h+/ /g;
         $row->[10] =~ s/\h+/ /g;
         
         # Something to test on $row->[5] /#[^ ]+/
         
         # Testing print some fields. REMEMBER 1 aka Zero
         print "$row->[5]\t$row->[9]\t$row->[12]\t$row->[13]\n";
         push @rows, $row;
         } else { 
              last;
              }
    }
[download]

Since I am new at this please tell me if I should do something different for clarity. Also, if there is a more efficient way to do this then clarity probably wins out, for later troubleshooting. Thanks!

In reply to Extract data from CSV field. by JobC

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.