La12 has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,

Scenario:

I have been given a comma delimited text file that I need to "clean-up" and convert into tab delimited. Easy enough if I use substitute to replace all commas with tabs, BUT in this file, there are company names that contain commas(nothin' is ever easy, is it?). For example the incoming text looks like:

client,ABC Company, jwilkens,jdoe,mdeart #this one is fine
client,Cornell,Thayil,Cameron & Sheppard, LLC, klivingston,amarison,pserton #this one is not
client,Doe,Smith & Randall, Inc, jwaters,pfloyd,jjoplin #this one is not

Now the one thing that is consistent (my example doesn't show this) is that the company names will always start at the 15th character from the left and end at the 25th character from the left. If a company name is shorter than 10 characters then spaces will be added to make it 10 characters.

My game plan is to

$foo = entire string (say 100+ characters); if {$foo contains a comma within the 15th thru 25th character from the + left then replace it with a "+"; #I can use any nonalpha character here }; replace all commas in $foo with "\t"; #replace all commas with tabs replace all "+" with commas; # replace all plus-signs with commas

My question is... How do I write the IF statement?
Also, someone wrote that this gets rid of empty spaces? If so, what does it substitute it with?

$foo =~ s/\s/\\g;
Thanks for your time.

Replies are listed 'Best First'.
Re: Help with substitution - - 15 Characters from the left
by blakem (Monsignor) on Aug 13, 2001 at 22:55 UTC
    If there is only one data field that can contain commas, I would recommend a slightly different approach.

    Split the line up on every comma.
    Grab the last M fields from the back.
    Grab the first N fields from the front.
    Join whatever is left using commas into a single field
    Join the three pieces together with tabs.

    Here is a quick example to get you up to speed.

    #!/usr/bin/perl use strict; my $line = 'john,doe,manager,Pepsi Cola, inc.,Detroit,Michigan'; my @fields = split(/,/,$line); # split on commas my @last2 = splice(@fields,-2); # Grab two fields off the back my @first3 = splice(@fields, 0,3); # Grab three fields from the front my @middle = @fields; # Whatever is left is the difficul +t "company" data my $middlefield = join(',',@middle); # join the middle portion back t +ogether print "First: $_\n" for (@first3); print "Middle: $_\n" for (@middle); print "Last: $_\n" for (@last2); print "\n"; my $newline = join("\t",@first3,$middlefield,@last2); # generate new +line print "Old: $line\n"; print "New: $newline\n";
    I didn't quite understand your regex, but if you want to collapse multiple spaces down into a single one you might try s/\s+/ /g.

    -Blake

Re: Help with substitution - - 15 Characters from the left
by dvergin (Monsignor) on Aug 13, 2001 at 23:10 UTC
    Here's an (almost) one-line solution.

    It took a few minutes, but this looks like a good prototype for your needs. You will likely need to adjust the tests on the position variable $i to fit your need more precisely...

    use strict; my $str = 'a,b,c,d,e,f,g,h,i,j,k,l,m,n'; print "$str\n"; my $i = 0; $str =~ s/(.)/$i++;($1 ne ',' or $i>15 && $i<25) ? $1 : '#'/ge; print "$str\n";
    This prints:
    a,b,c,d,e,f,g,h,i,j,k,l,m,n a#b#c#d#e#f#g#h,i,j,k,l,m#n
    I used '#' instead of "\t" for demo purposes so things would line up nicely in the print test.

    Note: along the way I tried using pos instead of $i, but apparently it is undefined while still in the right side of the s///. Update: Hofmator is correct to say I am wrong to imply that pos would work outside the s///.   pos is not set at all for s///.

    Note 2: I offer this as a solution to the problem as you described it but it seems very odd that you would have a list of companies with names that are exactly 10 characters long.

      re: note 2

      The actual is number is 23 characters. I gave 10 in my example 'cause it's a nice round number. True, my sample data does not show this, that's why I said...

      "Now the one thing that is consistent (my example doesn't show this) is that the company names will always start at the 15th character from the left and end at the 25th character from the left..."

      If a company name has more than 23 characters, the name would get cut off (I have no control over the data that is given to me).

      Thank you for your example. I think that it is exactly what I needed. I will also look at pos() as I am unfamiliar with it.

      Note: along the way I tried using pos() instead of $i, but apparently it is undefined while still in the right side of the s///
      well, this is right, but it is also undefined outside the s/// construct!! That's because (from perldoc)
      pos returns the offset of where the last "m//g" search left off
      and that's it. Pos is not set with s///.

      -- Hofmator

Re: Help with substitution - - 15 Characters from the left
by THRAK (Monk) on Aug 13, 2001 at 22:49 UTC
    There's probably better/more efficient ways to do this, but this might do what you are asking. I'm not sure because your question isn't real clear. This snippet will tab delimit the data after replacing the comma's and truncating the end spaces in the second (pos 15-25) field. In the future it is in your best interest to provide a clearer explaination and an exact representation of the data you want to parse. People will help you here, but you need to do your part also.
    #!/usr/local/bin/perl -w use strict; my @data = ('client-up-to-,Doe,Smith , Inc, jwaters,pfloyd,jjoplin'); foreach my $line (@data) { chomp $line; my $beg_line = substr($line, 0, 14); my $name_field = substr($line, 14, 10); my $end_line = substr($line, 25); print "B: ~$name_field~\n"; $name_field =~ tr/,/ /; $name_field =~ s/\s*$//; print "A: ~$name_field~\n"; $line = "${beg_line}${name_field}${end_line}"; $line =~ tr/,/\t/; print "~$line~\n"; } __END__ B: ~Doe,Smith ~ #note space is still in at this point A: ~Doe Smith~ ~client-up-to- Doe Smith Inc jwaters pfloyd jjoplin~


    -THRAK
    www.polarlava.com
Re: Help with substitution - - 15 Characters from the left
by IraTarball (Monk) on Aug 13, 2001 at 22:51 UTC
    I have another tack on this problem, it might not be better though. If the number of fields is consistent and the only field that contains commas is the company name you could rebuild the company name after splitting. The following seems to work.

    use strict; use warnings; my $max = 4; #Largest expected index. while (<DATA>) { s/#.*$//; #Clear out the comments chomp; my @foo = split /,/; my $company = $foo[1]; #prime a variable with a company name if ($#foo > $max) { #If there are more elements than the ma +x... my $range = (scalar @foo) - $max; #Get the top of the range $company = join '+', @foo[1..$range]; #The company name is in +a slice } print "$company\n"; #do whatever with the data. } __END__ client,ABC Company, jwilkens,jdoe,mdeart #this one is fine client,Cornell,Thayil,Cameron & Sheppard, LLC, klivingston,amarison,ps +erton #thi s one is not client,Doe,Smith & Randall, Inc, jwaters,pfloyd,jjoplin #this one is n +ot
    I didn't bench this against substr and index, that might be more efficient, and I now this code could be compressed a bit but I thought this was a fairly clear way to write a sample.

    Ira,

    "So... What do all these little arrows mean?"
    ~unknown

Re: Help with substitution - - 15 Characters from the left
by kjherron (Pilgrim) on Aug 14, 2001 at 00:04 UTC
    To directly answer your question, something like this would work:
    substr($foo, 14, 10) =~ tr/,/+/; $foo =~ tr/,/\t/; substr($foo, 14, 10) =~ tr/+/,/;
    You may need to adjust the substr() arguments to make sure the correct range of characters is being selected. There's no point in testing the company name for commas before converting them to plusses; it's probably just as much work for perl to look for the commas as it would be to go ahead and convert them.

    One problem with this is that you may eventually have a company name with plusses in it. Besides that, it's three steps. I'm guessing company names won't have tabs in them, so the following works just as well and is only two steps:

    $foo =~ tr/,/\t/; substr($foo, 14, 10) =~ tr/\t/,/;
Re: Help with substitution - - 15 Characters from the left
by Hofmator (Curate) on Aug 13, 2001 at 22:32 UTC

    For your if statement take a look at substr and index.

    $foo =~ s/\s/\\g;
    The code above is not working, what would work is: $foo =~ s/\s//g; # not efficient which substitutes every whitespace (\s, i.e. space, tab, newline, ...) with nothing. Thus every whitespace is deleted. But I don't see how that could help you for your problem.

    For your replacements you should look at the tr/// operator.

    -- Hofmator

      The last question about the spaces was not related to the main problem. I just saw someone post it and thought that you could not substitute something with nothing. I will check out tr///.