Janwhatever has asked for the wisdom of the Perl Monks concerning the following question:

Hi there! I'm new to perl and new to perlmonks and hoping to find answers here:) I have a large file which i need to split up into multiple files. There is a condition, the file has to be split up according to country ($elements1). The file is a CSV file and is separated by semicolon like so:
Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Castle Park;45,769969;19,9999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0; Andorra;Andorra;El Serrat;Hotel;Subi;43,658607;5,568623;;ok;0; Andorra;Andorra;El Serrat;Hotel;Tristan;42,618507;1,538923;;ok;0; Andorra;Andorra;El Tarter;Hotel;Del;42,580340;1,648919;;ok;0; België;Antwerpen;Antwerpen;Bezienswaardigheid;Boerentoren (Kbc Toren); +51,2000;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Borze;51,2200;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Boulevard ;51,220000;4,399000 +;;ok;0; België;Antwerpen;Antwerpen;Appartement;Britselei 37;51,220000;4,399000 +;;ok;0; Engeland;Groot Londen;Londen;Museum;Tate Britain;51,5200;-0,126236;;ok +;0; Engeland;Groot Londen;Londen;Museum;Tate Modern;51,500200;-0,126236;;o +k;0; Engeland;Groot Londen;Londen;Hotel;Testing;51,500200;-0,126236;;ok;0; Engeland;Groot Londen;Londen;Bezienswaardigheid;Thames (rivier);51,020 +0;-0,126236;;ok;0; Engeland;Groot Londen;Londen;Hotel;The Archery London;51,500200;-0,126 +236;;ok;0; Finland;Finland;Ylläsjärvi;Hotel;Yllasrinne;67,526750;24,275630;;ok;0; Finland;Finland;Ypäjä;Bezienswaardigheid;Loimijoki Golf;0,805300;23,27 +1400;;ok;0 Finland;Finland;Ytteresse;Vakantiepark;Solhaga;0,630700;22,962500;;ok; +0 Finland;Lapland;Hetta;Hotel;Hetan Majatalo;6,34700;23,633800;;ok;0 Frankrijk;Alsace (Elzas);Rouffach;Hotel;Au Relais D'Alsace;7,959680;7, +29970;;ok;0 Frankrijk;Alsace (Elzas);Rouffach;Hotel;Château d Isenbourg;7,959680;7 +,29970;;ok;0 Frankrijk;Alsace (Elzas);Ruederbach;Vakantiehuis;Au Cheval Blanc;47,56 +1440;7,26840;;ok;0 Frankrijk;Alsace (Elzas);Sausheim;Camping;Le fary;4,787300;7,36390;;ok +;0;
And i want it to arrange it like this:
OUTFILE: Albanië.csv Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Castle Park;45,769969;19,9999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0; OUTFILE: Andorra.csv Andorra;Andorra;El Serrat;Hotel;Subi;43,658607;5,568623;;ok;0; Andorra;Andorra;El Serrat;Hotel;Tristan;42,618507;1,538923;;ok;0; Andorra;Andorra;El Tarter;Hotel;Del;42,580340;1,648919;;ok;0; OUTFILE: België.csv België;Antwerpen;Antwerpen;Bezienswaardigheid;Boerentoren (Kbc Toren); +51,2000;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Borze;51,2200;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Boulevard ;51,220000;4,399000 +;;ok;0; België;Antwerpen;Antwerpen;Appartement;Britselei 37;51,220000;4,399000 +;;ok;0; OUTFILE: Engeland.csv Engeland;Groot Londen;Londen;Museum;Tate Britain;51,5200;-0,126236;;ok +;0; Engeland;Groot Londen;Londen;Museum;Tate Modern;51,500200;-0,126236;;o +k;0; Engeland;Groot Londen;Londen;Hotel;Testing;51,500200;-0,126236;;ok;0; Engeland;Groot Londen;Londen;Bezienswaardigheid;Thames (rivier);51,020 +0;-0,126236;;ok;0; Engeland;Groot Londen;Londen;Hotel;The Archery London;51,500200;-0,126 +236;;ok;0; OUTFILE: Finland.csv Finland;Finland;Ylläsjärvi;Hotel;Yllasrinne;67,526750;24,275630;;ok;0; Finland;Finland;Ypäjä;Bezienswaardigheid;Loimijoki Golf;0,805300;23,27 +1400;;ok;0 Finland;Finland;Ytteresse;Vakantiepark;Solhaga;0,630700;22,962500;;ok; +0 Finland;Lapland;Hetta;Hotel;Hetan Majatalo;6,34700;23,633800;;ok;0 OUTFILE: Frankrijk.csv Frankrijk;Alsace (Elzas);Rouffach;Hotel;Au Relais D'Alsace;7,959680;7, +29970;;ok;0 Frankrijk;Alsace (Elzas);Rouffach;Hotel;Château d Isenbourg;7,959680;7 +,29970;;ok;0 Frankrijk;Alsace (Elzas);Ruederbach;Vakantiehuis;Au Cheval Blanc;47,56 +1440;7,26840;;ok;0 Frankrijk;Alsace (Elzas);Sausheim;Camping;Le fary;4,787300;7,36390;;ok +;0;
These are just a few countries in the file, there are many more but this is the general idea. I split the file with @elements = split(/;/,$infile[$i]); but from there i don't know how to procede exactly. I found these topics where somebody is trying a similar thing but with digits. Where as i'm trying it with letters (the countries). Can someone point me in the right direction? with AWK? These look similar as to what i want to accomplish:

http://www.unix.com/shell-programming-scripting/167923-how-split-data-file-into-separate-files-file-names-depending-upon-columns-value.html

http://www.unix.com/shell-programming-scripting/60351-split-file-into-multiple-files-depending-upon-first-4-digits.html

Replies are listed 'Best First'.
Re: AWK? Split one file in seperate files based on country
by BrowserUk (Patriarch) on May 30, 2012 at 22:20 UTC

    A one liner:

    perl -F; -anle"$F[0] ne $last and open F, '>', $F[0].'.csv'; print F; +$last = $F[0]" junk.dat C:\test>dir *.csv 30/05/2012 23:17 204 Albani‰.csv 30/05/2012 23:17 194 Andorra.csv 30/05/2012 23:17 325 Belgi‰.csv 30/05/2012 23:17 394 Engeland.csv 30/05/2012 23:17 295 Finland.csv 30/05/2012 23:17 332 Frankrijk.csv C:\test>type Albani‰.csv Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Castle Park;45,769969;19,9999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0; C:\test>type Andorra.csv Andorra;Andorra;El Serrat;Hotel;Subi;43,658607;5,568623;;ok;0; Andorra;Andorra;El Serrat;Hotel;Tristan;42,618507;1,538923;;ok;0; Andorra;Andorra;El Tarter;Hotel;Del;42,580340;1,648919;;ok;0; C:\test>type Belgi‰.csv België;Antwerpen;Antwerpen;Bezienswaardigheid;Boerentoren (Kbc Toren); +51,2000;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Borze;51,2200;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Boulevard ;51,220000;4,399000 +;;ok;0; België;Antwerpen;Antwerpen;Appartement;Britselei 37;51,220000;4,399000 +;;ok;0;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      perl -F';' -ane'open F, ">>$F[0].csv" and print F' junk.dat

        Nice golfing :)

        You'd have to remember to delete the output files before a re-run; and if the input is big, reopening the output files would likely be a significant factor.

        I kept thinking about the one-arg open, but I couldn't make it work with an array element.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: AWK? Split one file in seperate files based on country
by choroba (Cardinal) on May 30, 2012 at 22:16 UTC
    Sorry, no awk. Here is a Perl solution.
    It reads the whole file into a hash, each line is stored at its country. Then, output files are created and their lines are saved to them.
    #!/usr/bin/perl use strict; use warnings; my %countries; open my $IN, '<:encoding(utf-8)', '1.csv' or die $!; while (<$IN>) { my @columns = split /;/; push @{ $countries{$columns[0]} }, $_; } for my $country (keys %countries) { open my $OUT, '>:encoding(utf-8)', "$country.csv" or die $!; for (@{ $countries{$country} }) { print {$OUT} $_; } close $OUT or die $!; }
    I used UTF-8 encoding for the accented characters. You might need to change the encoding if your input file uses a different one.
      Hi, i'm trying to understand what you created and piece by piece i'm putting it together. But when i run it i get the following error.

      utf8 "\xEB" does not map to Unicode at C:\bla\bla\bla line 9, <$IN> line XXXX.

      So i'm guessing this has something to do with the encoding. I'm not sure what kind of encoding i use. Is there a way to look this up in my file? Or is it possible to remove the encoding part? , '<:encoding(utf-8)' so it will read it as a normal file perhaps? Because i didn't need it before.

      @ BrowserUk / jwkrahn: i dont have enough experience to work with your answers i'm afraid

      edit: When i remove the encoding part i get this: No such file or directory at Z:\Data-Content\Data\test\jan\ALL_DATA\ori.pl line 13, <$IN> line 50001. (line 50001 being the end of the input file). What am i doing wrong?
        Have you removed the : as well? The open should look like
        open my $IN, '<', '1.csv' or die $!;
Re: AWK? Split one file in seperate files based on country (awkmonks)
by runrig (Abbot) on May 30, 2012 at 23:44 UTC
    Sometimes, awk is easiest:
    awk -F";" '{print >$1 ".csv"}' file.csv
Re: AWK? Split one file in seperate files based on country
by GrandFather (Saint) on May 31, 2012 at 10:02 UTC

    Depending somewhat on how the file was created it may not be sufficient to simplistically parse each line by splitting on semi-colons. If it is a well formed csv file you should use one of the CSV modules such as Text::CSV to manage parsing the file. Consider:

    use strict; use warnings; use Text::CSV qw{}; # Create a test file my $fname = 'delme.csv'; open my $fOut, '>', $fname or die "Can't create $fname: $!\n"; print $fOut <<'TESTDATA'; country_name;region;city;type;area;lat;long;etc Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Castle Park;45,769969;19,9999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0; Andorra;Andorra;El Serrat;Hotel;Subi;43,658607;5,568623;;ok;0; Andorra;Andorra;El Serrat;Hotel;Tristan;42,618507;1,538923;;ok;0; Andorra;Andorra;El Tarter;Hotel;Del;42,580340;1,648919;;ok;0; België;Antwerpen;Antwerpen;Bezienswaardigheid;Boerentoren (Kbc Toren); +51,2000;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Borze;51,2200;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Boulevard ;51,220000;4,399000 +;;ok;0; België;Antwerpen;Antwerpen;Appartement;Britselei 37;51,220000;4,399000 +;;ok;0; Engeland;Groot Londen;Londen;Museum;Tate Britain;51,5200;-0,126236;;ok +;0; Engeland;Groot Londen;Londen;Museum;Tate Modern;51,500200;-0,126236;;o +k;0; Engeland;Groot Londen;Londen;Hotel;Testing;51,500200;-0,126236;;ok;0; Engeland;Groot Londen;Londen;Bezienswaardigheid;Thames (rivier);51,020 +0;-0,126236;;ok;0; Engeland;Groot Londen;Londen;Hotel;The Archery London;51,500200;-0,126 +236;;ok;0; Finland;Finland;Ylläsjärvi;Hotel;Yllasrinne;67,526750;24,275630;;ok;0; Finland;Finland;Ypäjä;Bezienswaardigheid;Loimijoki Golf;0,805300;23,27 +1400;;ok;0 Finland;Finland;Ytteresse;Vakantiepark;Solhaga;0,630700;22,962500;;ok; +0 Finland;Lapland;Hetta;Hotel;Hetan Majatalo;6,34700;23,633800;;ok;0 Frankrijk;Alsace (Elzas);Rouffach;Hotel;Au Relais D'Alsace;7,959680;7, +29970;;ok;0 Frankrijk;Alsace (Elzas);Rouffach;Hotel;Château d Isenbourg;7,959680;7 +,29970;;ok;0 Frankrijk;Alsace (Elzas);Ruederbach;Vakantiehuis;Au Cheval Blanc;47,56 +1440;7,26840;;ok;0 Frankrijk;Alsace (Elzas);Sausheim;Camping;Le fary;4,787300;7,36390;;ok +;0; TESTDATA close $fOut; # Now for the sample interesting code open my $fIn, '<', $fname or die "Can't open $fname: $!\n"; my $csv = Text::CSV->new({binary => 1, sep_char => ';'}); my %byCountry; my @columns = @{$csv->getline($fIn)}; push @{$byCountry{$_->[0]}}, $_ while $_ = $csv->getline($fIn); close $fIn; for my $country (sort keys %byCountry) { print "--- $country.csv---\n"; $csv->print (*STDOUT, \@columns); print "\n"; for my $row (@{$byCountry{$country}}) { $csv->print (*STDOUT, $row); print "\n"; } }

    Prints (in part):

    --- Albanië.csv--- country_name;region;city;type;area;lat;long;etc Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;"Castle Park";45,769969;19,9999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0; --- Andorra.csv--- country_name;region;city;type;area;lat;long;etc Andorra;Andorra;"El Serrat";Hotel;Subi;43,658607;5,568623;;ok;0; ... Finland;Finland;Ypäjä;Bezienswaardigheid;"Loimijoki Golf";0,805300;23, +271400;;ok;0 Finland;Finland;Ytteresse;Vakantiepark;Solhaga;0,630700;22,962500;;ok; +0 Finland;Lapland;Hetta;Hotel;"Hetan Majatalo";6,34700;23,633800;;ok;0 --- Frankrijk.csv--- country_name;region;city;type;area;lat;long;etc Frankrijk;"Alsace (Elzas)";Rouffach;Hotel;"Au Relais D'Alsace";7,95968 +0;7,29970;;ok;0 Frankrijk;"Alsace (Elzas)";Rouffach;Hotel;"Château d Isenbourg";7,9596 +80;7,29970;;ok;0 Frankrijk;"Alsace (Elzas)";Ruederbach;Vakantiehuis;"Au Cheval Blanc";4 +7,561440;7,26840;;ok;0 Frankrijk;"Alsace (Elzas)";Sausheim;Camping;"Le fary";4,787300;7,36390 +;;ok;0;
    True laziness is hard work
Re: AWK? Split one file in seperate files based on country
by Corion (Patriarch) on May 31, 2012 at 08:37 UTC

    Also see part - split up files according to column value, which could handle the heavy lifting. It has no provisions for handling UTF-8 data or other encodings, so you might want to modify it to open the input file with the appropriate encoding:

    binmode ARGV, ':utf8'; # if your input is encoded as UTF-8

    For generating the output filenames, I prefer ASCII filenames, so I'd use Text::CleanFragment or the underlying Text::Unidecode to generate filesystem-clean filenames from values.

Re: AWK? Split one file in seperate files based on country
by Janwhatever (Novice) on May 30, 2012 at 22:51 UTC
    Wow, thanks for the quick reply to all! I'll get right into it.. Hope i can figure it out! thx!