gorkemsarikaya has asked for the wisdom of the Perl Monks concerning the following question:

Hi everybody
I have a problem with split function:
My input is:
1234 2321 0 45 1st 2122 sdsa 0 0 34 2313 dsad 43 2nd 1232 ffff 0 0 1st 3213 sadf 0 34 2133 dada 0 2nd

My purpose is to split this data to columns, I wrote this script:

$in_file = "input.txt"; open (IN, "<$in_file") or die "Can't open $file: $!\n"; $out_file = "output.txt"; open (OUT, ">$out_file") or die "Can't open $file: $!\n"; while ( $line = <IN> ) { @fields = split(/\s+/,$line); print "$fields[2]\n"; } close IN; close OUT;

But; When we write 3rd column, I achieves this data:
0
0
43
0
0
0
But I want to see like this according to my input:
0
0

0

0

This spaces make me freak,somewhere in input txt, there is more than one space, and some columns also have space char, please help me!

Replies are listed 'Best First'.
Re: To split with spaces
by Cristoforo (Curate) on Aug 04, 2013 at 20:13 UTC
    If it looks like:
    1234 2321 0 45 1st 2122 sdsa 0 0 34 2313 dsad 43 2nd 1232 ffff 0 0 1st 3213 sadf 0 34 2133 dada 0 2nd
    with fixed width columns, you could use unpack or substr to parse it. If the columns are tab separated, you could split on tab.

    Chris

      Firstly, thank you for your kind and smart answer.

      You are right data is not obvious, sorry for this, I am not familiar with html codes. My input data is exactly as you predicted:

      1234 2321 0 45 1st 2122 sdsa 0 0 34 2313 dsad 43 2nd 1232 ffff 0 0 1st 3213 sadf 0 34 2133 dada 0 2nd

      As it is seen, there is different number of spaces between columns. So /\s/ is not working as well as /\s+/ is not working, because some columns have whitespace characters. Also substr function does not work due to same reason. Substr does not see whitespace character and passes to next column. I hope told my problem clearly:)

        To answer your question, it is necessary to know more about your data. There are 6 columns and in your sample the third and fourth columns are 0 or blank. Is it possible that they could be a 2 or 3 digit number, like 33 or 123? Is the fifth column always a 2 digit number like 45, 34, 43... or could it be 1 digit (or 3 digits)? Its hard to tell with this data whether fields fill in from the left or right.

        And I'm guessing that the leading whitespace before the 1st column isn't really there, but is just the way you pasted it in.

Re: To split with spaces
by Laurent_R (Canon) on Aug 04, 2013 at 21:07 UTC

    This is not really a Perl problem. Your problem is to define exactly what your input really looks like, in order to figure out whether the third column exists or is missing. In other words, the problem is to define the input format. Once we know that, writing the Perl program that can do what you need is probably very easy.

    As Cristoforo said, perhaps you have fixed length fields, in which case pack or substr are problably likely candidates for the functions you want to use. If you have tab separated fields, split is more likely to solve your problem. Or, maybe, the solution is in a regular expression match. It could also be that splitting on a single space (rather than multiple spaces with /\s+/ , as suggested by 0day, is simply the solution. But we can't figure out exactly what your input file really looks like from your post, because it has probably been reformatted in your post. At the very least, please supply your input file within code tags, we will be more likely to understand your input file format.

    It would be even better to have a link to a sample of your input file. That would be better, because if you copy and paste a section of the file, it is quite possible that tabs get copied as groups of spaces, so that it might be difficult to undertand the real format or the original file.

      Firstly, thank you for your kind answer.

      You are right data is not obvious, sorry for this, I am not familiar with html codes. My input data is exactly:

      1234 2321 0 45 1st 2122 sdsa 0 0 34 2313 dsad 43 2nd 1232 ffff 0 0 1st 3213 sadf 0 34 2133 dada 0 2nd

      As it is seen, there is different number of spaces between columns. So /\s/ is not working as well as /\s+/ is not working, because some columns have whitespace characters. Also substr function does not work due to same reason. Substr does not see whitespace character and passes to next column. I hope told my problem clearly:)

        Hi there, if that's the case then to parse all the fields something like:

        printf "|%4s|%4s|%2s|%2s|%2s|%3s|\n", map {s/\s+//g;$_} unpack "A11A5A3A3A3A*" for <DATA>; __DATA__ 1234 2321 0 45 1st 2122 sdsa 0 0 34 2313 dsad 43 2nd 1232 ffff 0 0 1st 3213 sadf 0 34 2133 dada 0 2nd
        Would print:
        |1234|2321| 0| |45|1st| |2122|sdsa| 0| 0|34| | |2313|dsad| | |43|2nd| |1232|ffff| 0| 0| |1st| |3213|sadf| | 0|34| | |2133|dada| 0| | |2nd|

        Now that we have a format making sense, i.e. a fixed-column format, this definitely looks like a work for the substr or unpack function, the problem is to find the right parameters (offset and lenbgth) to retrieve your fields. I can't make a test right now, but will come back to you when I can.

        UPDATE: actually, I had not seen that when I posted the above 3 minutes ago, but Davido and others have already given a solution. Probably no point to come back and give the same.

Re: To split with spaces
by ww (Archbishop) on Aug 04, 2013 at 21:13 UTC
    As posted, your data fields are separated by one or more spaces and Line 3 has "43" as its third field (eg $field[2]... so the result from your code is as you should expect. The same applies to Line 5. And I'm not absolutely clear about what you're trying to tell us in the last line of your post.

    Your failure to use code tags (viz, the formatting instructions at the text entry box where you created your node) makes it difficult to tell exactly how you intended the data to be structured -- you used multiple non-breaking space entities, but did you do so to match the actual spaces (0x20) in your data or to make the rendered appearance like that of a table with tabs?

    In short, more information from you and closer attention to the local formatting directions will make it easier for us to help you.

    If I've misconstrued your question or the logic needed to answer it, I offer my apologies to all those electrons which were inconvenienced by the creation of this post.
Re: To split with spaces
by 0day (Sexton) on Aug 04, 2013 at 19:13 UTC
    Try: @fields = split(/\s/,$line);
      Firstly, thank you for your quick answer.

      My input data is exactly:

      1234 2321 0 45 1st 2122 sdsa 0 0 34 2313 dsad 43 2nd 1232 ffff 0 0 1st 3213 sadf 0 34 2133 dada 0 2nd

      As it is seen, there is different number of spaces between columns. So /\s/ is not working as well as /\s+/ is not working, because some columns have whitespace characters. Also substr function does not work due to same reason. Substr does not see whitespace character and passes to next column. I hope told my problem clearly:)

        Your data is in fixed-width fields. The third column always starts at the same character position one line after another. substr would work just fine for this. Given the example data you posted, you just need to start at the 15th position, and read two characters. In other words, my $third_col = substr $line, 15, 2;

        In fact, it's possible that you could just start at the 16th position and read a single character, but I would need to know more about the input data before I could be sure.

        Anyway, for the data you posted, this works fine:

        use v5.14; say unpack( 'x15A2' ) =~ s/^\s+|\s+$//gr while <DATA>; __DATA__ 1234 2321 0 45 1st 2122 sdsa 0 0 34 2313 dsad 43 2nd 1232 ffff 0 0 1st 3213 sadf 0 34 2133 dada 0 2nd

        I used unpack instead of substr, but either one would work fine.


        Dave

Re: To split with spaces
by ricDeez (Scribe) on Aug 04, 2013 at 22:15 UTC

    Another option is to use pipe delimited text as it allows you to visually inspect the data in any text editor. You could then do something like this:

    use v5.12; use warnings; use Data::Dump qw(ddx); my @fields = map { ( split /\|/ )[2] } map { chomp; $_ } <DATA>; ddx @fields; # test.pl:5: (0, 0, "", 0, "", 0) __DATA__ 1234|2321|0|45|1st 2122|sdsa|0|0|34 2313|dsad||43|2nd 1232|ffff|0|0|1st 3213|sadf||0|34 2133|dada|0||2nd

      While that is one way to do it, the OP doesn't have pipe-delimited data. They have the format shown, and that seems to be what they must work with.
      It is possible to convert it into pipe-delimited, but then we'd be back where we are now. ;-)

      ~Thomas~ 
      "Excuse me for butting in, but I'm interrupt-driven..."
Re: To split with spaces
by locked_user sundialsvc4 (Abbot) on Aug 05, 2013 at 12:31 UTC

    Here are another couple of useful tips:

    • Use hexdump or a similar tool to examine the contents of the file byte-by-byte.   Don’t assume anything:   a “blank space” could be tabs, spaces, or even characters that are unprintable according to the internationalization (I18N) settings of whatever tool you may happen to be using.   When you are showing excerpts of such files to us, enclose them in <code> tags.   You can write a program to split according to any sort of bright-line rule.
    • Once you think you have a bright-line rule, write a script to prove it.   Take every assumption that you think holds true for the entire catalog of such files that you have, then write scripts that will survive only-if those assumptions are correct; otherwise they die in a meaningful way.   Run those scripts against a broad cross-section of the files.   Run them automatically against new files that come in.   (Sometimes you find that you are debugging, not only the programs that you wrote to consume the files, but the programs that other people wrote to produce them.)

Re: To split with spaces
by Laurent_R (Canon) on Aug 05, 2013 at 22:01 UTC

    Thank you, but I tried and tested substr and unpack functions. These are not working. Because our input data is not a fixed-column format. Some of columns have whitespace characters and substr and unpack functions ignore these whitespace characters and pick up next columns...

    The pack and substr functions don't ignore white spaces. But given that they work with positions within the string, they may have trouble solving mixtures of whites spaces and tabulations (because a tab takes only one position in a string, but usually several on the printed line). This is at least my hypothesis # 1, by far the most likely in my eyes. But you could also have some other nasty invisible characters (backspace and what not), which we cannot guess with the copy and paste that you are providing so far.

    We really need to know exactly and in detail what you raw file looks like (unformated). Either make the file available by some means so that we can download it and look at it, or possibly supply an hex dump of it (although this is less practical).

    Meanwhile, you could also try to split your records on single tabs, rather than spaces, and see what you get. Changing your original code to something like this:

    @fields = split /\t/, $line;

    It might just be the solution.

Re: To split with spaces
by Anonymous Monk on Aug 06, 2013 at 03:50 UTC

    Hi,

    Thre first thing to do is to go back to your boss and ask him for the file spec.

    Any of those blanks might, on different lines, have a number or letter in it. Back in the distant past, when disk was expensive, to save space we would put 8 1 bit flags into a 1 byte column in a fixed-width file.

    Once you have the file spec, you will know the format of the file and things will start to fall into place.

    J.C.