angela2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am trying to split a file on whitespace and thankfully I tried my code with various examples of columns as I found out it only works for column 1 :(
#!/usr/bin/perl use warnings; use strict; my %splitting; open my $FILE, '<', 'input_file' or die $!; while (<$FILE>) { chomp; my @columns = split ' '; $splitting{$columns[0]} = [@columns[1 .. $#columns]]; print "@columns[3] \n"; } close $FILE;

So if I print column 1 then it works fine, but it seems that some of my columns aren't whitespace separated but the numbers are stuck together like so: let's say first column is 15,567 and then second column is -25,324 and then third column is -45,234, this in some cases is written as:

15,567 -25,324-45,234

(no white space between columns 2 and 3)

This creates a problem and of course returns errors. How can I work around this? I hadn't realised that the columns aren't always space delimited as I'm working with files with over 100,000 lines so couldn't check them all. I tried reformatting the file with an awk line I use quite a lot (awk '{printf "%-5s %25s %-5s \n", $1, $2, $3 }' - used "25s" just as an extreme number to easily see what's happening with my columns) but it doesn't help because again the file has to be space delimited.

Do you have any ideas about how I could reformat my columns so that they have a consistent style? I mean, I know how to use printf but the problem now is that I don't know how to tell perl where each column finishes and the next one begins.

  • Comment on problem with splitting file on whitespace: how to circumvent inconsistent formatting through file
  • Select or Download Code

Replies are listed 'Best First'.
Re: problem with splitting file on whitespace: how to circumvent inconsistent formatting through file
by BrowserUk (Patriarch) on Jul 04, 2016 at 12:10 UTC

    This should work:

    print join "\t", split ' *(?=[^0-9,])', '15,567 -25,324-45,234';; 15,567 -25,324 -45,234

    As a one-liner:

    perl -nle"print join qq[\t], split ' *(?=[^0-9,])'" < bad.file > good. +file

    Don't forget to swap 's and "s if you're on *nix.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.
      I don't know how I managed to get it wrong but after an hour of trying I made them both work :) Thank you very much!
Re: problem with splitting file on whitespace: how to circumvent inconsistent formatting through file
by Anonymous Monk on Jul 04, 2016 at 12:25 UTC

    Just a guess, are the columns fixed width? If so, unpack can help.

    while (<DATA>) { chomp; my @fields = unpack('a7 a7 a7'); print join(" ", map {"\"$_\""} @fields), "\n"; } __DATA__ 15,567-25,324-45,234 -13,345 53,562 13,452 -7,521-22,454-54,671

    Outputs:

    " 15,567" "-25,324" "-45,234" "-13,345" " 53,562" " 13,452" " -7,521" "-22,454" "-54,671"

      Hi and thanks for your answer, it's the one I think I understand the most, but it's very kind that you all took the time and tried to help. I'll also try to understand the other two answers hopefully.

      I wasn't successful implementing this but I'm wondering if it has to do with the fact that I haven't really given you a clear picture of what my data looks like, so this file looks like this - this is probably a pretty silly way to show you but here's a link to a screenshot of my data https://www.dropbox.com/s/68xjbspn47jzmx9/Untitled.png?dl=0 , sorry I know this is really random but I have no clue how to properly format the numbers so that you can see.

      As you can see I have 4 columns, the last one is always 1 digit and the rest always have 3 decimal digits. Can your solution (with some editing) still be implemented on my data?

        Can't you just copy and paste some of your data here inside <code> tags? Those will preserve formatting.

        Anyway, from this small sample it does look like your columns are fixed width. The first one looks to have either 7 or 8 characters, the second and third 8 characters - try an unpack pattern of "a8 a8 a8" and see what you get. Adjust and extend this pattern as is appropriate for your input data, for example to add another column "a8 a8 a8 a6" and so on.

        For a tutorial see "Packing Text" in perlpacktut.

Re: problem with splitting file on whitespace: how to circumvent inconsistent formatting through file
by GotToBTru (Prior) on Jul 04, 2016 at 12:19 UTC

    Try:

    while(<$FILE>) { chomp ($line = $_); # @columns = $line =~ m/\s*(-?\d+)/g; @columns = $line =~ m/\s*(-?[\d,]+)/g; }

    Update: thanks to Lotus1 for the correction. I had it right on my computer, but failed to update the post.

    But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

      Hi GotToBTru, \d does not match commas so your program splits up the numbers. If you add the comma in a character class it works.

      use warnings; use strict; while(<DATA>) { chomp (my $line = $_); my @columns = $line =~ m/\s*(-?\d+)/g; print "@columns\n"; } __DATA__ 15,567 -25,324-45,234 15,567-25,324-45,234 -13,345 53,562 13,452 -7,521-22,454-54,671
      Output: 15 567 -25 324 -45 234 15 567 -25 324 -45 234 -13 345 53 562 13 452 -7 521 -22 454 -54 671

      Here it is with the comma added.

      use warnings; use strict; while(my $line = <DATA>) { my @columns = $line =~ m/\s*(-?[,\d]+)/g; print "@columns\n"; } __DATA__ 15,567 -25,324-45,234 15,567-25,324-45,234 -13,345 53,562 13,452 -7,521-22,454-54,671
      Output: 15,567 -25,324 -45,234 15,567 -25,324 -45,234 -13,345 53,562 13,452 -7,521 -22,454 -54,671
        Yes, exactly, this works correctly. Thank you both though, I really appreciate the help and explanations, this is an amazing website.
Re: problem with splitting file on whitespace: how to circumvent inconsistent formatting through file
by shadowsong (Pilgrim) on Jul 05, 2016 at 09:54 UTC

    Hi angela2

    I can see that you've already had some great responses - some more easily understood than others; so for posterity here are a couple of one-liners similar to what BrowserUk suggested here Re: problem with splitting file on whitespace: how to circumvent inconsistent formatting through file that also take into account tab delimited columns and numbers with decimal points.

    one-liner to print all columns

    perl -F"\s*(?=[^\d,\.])" -wanle "print qq[@F]" badfile.txt

    one-liner to print column 1

    perl -F"\s*(?=[^\d,\.])" -wanle "print $F[0]" badfile.txt

    ...use $F[n-1] to access column n, e.g. print $F[1] will print the value for column 2.

    quick explanation of command-line flags

    • -a and -F splits line text into the @F array; -a tells it to do the splitting and -F is used to specify the delimiter to split on - the default delimiter is space
    • -w sets the warning flag similar to the use warnings pragma
    • -n along with -p are the more commonly used command line switches both concerned with reading <ARGV>. -n provides us with the implicit construct
      while (<>) { "...your code here ..." }
    • -l sets the output record separator $\ to the input record separator $/ - which is \n by default. I often use it to preclude the need to add \n in my print statements
    • -e the most commonly used switch - it tells perl not to load and run a program file but to run the text following the -e as a program


    See perlrun for a more info as I have only scratched the surface...

    Best Wishes,
    shadowsong

      This is good stuff, thanks!