AWK? Split one file in seperate files based on country

Janwhatever has asked for the wisdom of the Perl Monks concerning the following question:

Hi there! I'm new to perl and new to perlmonks and hoping to find answers here:) I have a large file which i need to split up into multiple files. There is a condition, the file has to be split up according to country ($elements1). The file is a CSV file and is separated by semicolon like so:


Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0;
Albanië;Albanië;Berat;Hotel;Castle Park;45,769969;19,9999;Meer;ok;0;
Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0;
Andorra;Andorra;El Serrat;Hotel;Subi;43,658607;5,568623;;ok;0;
Andorra;Andorra;El Serrat;Hotel;Tristan;42,618507;1,538923;;ok;0;
Andorra;Andorra;El Tarter;Hotel;Del;42,580340;1,648919;;ok;0;
België;Antwerpen;Antwerpen;Bezienswaardigheid;Boerentoren (Kbc Toren);
+51,2000;4,399000;;ok;0;
België;Antwerpen;Antwerpen;Bed&Breakfast;Borze;51,2200;4,399000;;ok;0;
België;Antwerpen;Antwerpen;Bed&Breakfast;Boulevard ;51,220000;4,399000
+;;ok;0;
België;Antwerpen;Antwerpen;Appartement;Britselei 37;51,220000;4,399000
+;;ok;0;
Engeland;Groot Londen;Londen;Museum;Tate Britain;51,5200;-0,126236;;ok
+;0;
Engeland;Groot Londen;Londen;Museum;Tate Modern;51,500200;-0,126236;;o
+k;0;
Engeland;Groot Londen;Londen;Hotel;Testing;51,500200;-0,126236;;ok;0;
Engeland;Groot Londen;Londen;Bezienswaardigheid;Thames (rivier);51,020
+0;-0,126236;;ok;0;
Engeland;Groot Londen;Londen;Hotel;The Archery London;51,500200;-0,126
+236;;ok;0;
Finland;Finland;Ylläsjärvi;Hotel;Yllasrinne;67,526750;24,275630;;ok;0;
Finland;Finland;Ypäjä;Bezienswaardigheid;Loimijoki Golf;0,805300;23,27
+1400;;ok;0
Finland;Finland;Ytteresse;Vakantiepark;Solhaga;0,630700;22,962500;;ok;
+0
Finland;Lapland;Hetta;Hotel;Hetan Majatalo;6,34700;23,633800;;ok;0
Frankrijk;Alsace (Elzas);Rouffach;Hotel;Au Relais D'Alsace;7,959680;7,
+29970;;ok;0
Frankrijk;Alsace (Elzas);Rouffach;Hotel;Château d Isenbourg;7,959680;7
+,29970;;ok;0
Frankrijk;Alsace (Elzas);Ruederbach;Vakantiehuis;Au Cheval Blanc;47,56
+1440;7,26840;;ok;0
Frankrijk;Alsace (Elzas);Sausheim;Camping;Le fary;4,787300;7,36390;;ok
+;0;
[download]

And i want it to arrange it like this:

 
OUTFILE: Albanië.csv
Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0;
Albanië;Albanië;Berat;Hotel;Castle Park;45,769969;19,9999;Meer;ok;0;
Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0;

OUTFILE: Andorra.csv
Andorra;Andorra;El Serrat;Hotel;Subi;43,658607;5,568623;;ok;0;
Andorra;Andorra;El Serrat;Hotel;Tristan;42,618507;1,538923;;ok;0;
Andorra;Andorra;El Tarter;Hotel;Del;42,580340;1,648919;;ok;0;

OUTFILE: België.csv
België;Antwerpen;Antwerpen;Bezienswaardigheid;Boerentoren (Kbc Toren);
+51,2000;4,399000;;ok;0;
België;Antwerpen;Antwerpen;Bed&Breakfast;Borze;51,2200;4,399000;;ok;0;
België;Antwerpen;Antwerpen;Bed&Breakfast;Boulevard ;51,220000;4,399000
+;;ok;0;
België;Antwerpen;Antwerpen;Appartement;Britselei 37;51,220000;4,399000
+;;ok;0;

OUTFILE: Engeland.csv
Engeland;Groot Londen;Londen;Museum;Tate Britain;51,5200;-0,126236;;ok
+;0;
Engeland;Groot Londen;Londen;Museum;Tate Modern;51,500200;-0,126236;;o
+k;0;
Engeland;Groot Londen;Londen;Hotel;Testing;51,500200;-0,126236;;ok;0;
Engeland;Groot Londen;Londen;Bezienswaardigheid;Thames (rivier);51,020
+0;-0,126236;;ok;0;
Engeland;Groot Londen;Londen;Hotel;The Archery London;51,500200;-0,126
+236;;ok;0;

OUTFILE: Finland.csv
Finland;Finland;Ylläsjärvi;Hotel;Yllasrinne;67,526750;24,275630;;ok;0;
Finland;Finland;Ypäjä;Bezienswaardigheid;Loimijoki Golf;0,805300;23,27
+1400;;ok;0
Finland;Finland;Ytteresse;Vakantiepark;Solhaga;0,630700;22,962500;;ok;
+0
Finland;Lapland;Hetta;Hotel;Hetan Majatalo;6,34700;23,633800;;ok;0

OUTFILE: Frankrijk.csv
Frankrijk;Alsace (Elzas);Rouffach;Hotel;Au Relais D'Alsace;7,959680;7,
+29970;;ok;0
Frankrijk;Alsace (Elzas);Rouffach;Hotel;Château d Isenbourg;7,959680;7
+,29970;;ok;0
Frankrijk;Alsace (Elzas);Ruederbach;Vakantiehuis;Au Cheval Blanc;47,56
+1440;7,26840;;ok;0
Frankrijk;Alsace (Elzas);Sausheim;Camping;Le fary;4,787300;7,36390;;ok
+;0;
[download]

These are just a few countries in the file, there are many more but this is the general idea. I split the file with @elements = split(/;/,$infile[$i]); but from there i don't know how to procede exactly. I found these topics where somebody is trying a similar thing but with digits. Where as i'm trying it with letters (the countries). Can someone point me in the right direction? with AWK? These look similar as to what i want to accomplish:

http://www.unix.com/shell-programming-scripting/167923-how-split-data-file-into-separate-files-file-names-depending-upon-columns-value.html

http://www.unix.com/shell-programming-scripting/60351-split-file-into-multiple-files-depending-upon-first-4-digits.html

Comment on AWK? Split one file in seperate files based on country Select or Download Code

Replies are listed 'Best First'.
Re: AWK? Split one file in seperate files based on country by BrowserUk (Patriarch) on May 30, 2012 at 22:20 UTC
A one liner: perl -F; -anle"$F[0] ne $last and open F, '>', $F[0].'.csv'; print F; +$last = $F[0]" junk.dat C:\test>dir *.csv 30/05/2012 23:17 204 Albani‰.csv 30/05/2012 23:17 194 Andorra.csv 30/05/2012 23:17 325 Belgi‰.csv 30/05/2012 23:17 394 Engeland.csv 30/05/2012 23:17 295 Finland.csv 30/05/2012 23:17 332 Frankrijk.csv C:\test>type Albani‰.csv Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Castle Park;45,769969;19,9999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0; C:\test>type Andorra.csv Andorra;Andorra;El Serrat;Hotel;Subi;43,658607;5,568623;;ok;0; Andorra;Andorra;El Serrat;Hotel;Tristan;42,618507;1,538923;;ok;0; Andorra;Andorra;El Tarter;Hotel;Del;42,580340;1,648919;;ok;0; C:\test>type Belgi‰.csv België;Antwerpen;Antwerpen;Bezienswaardigheid;Boerentoren (Kbc Toren); +51,2000;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Borze;51,2200;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Boulevard ;51,220000;4,399000 +;;ok;0; België;Antwerpen;Antwerpen;Appartement;Britselei 37;51,220000;4,399000 +;;ok;0; [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^2: AWK? Split one file in seperate files based on country by jwkrahn (Abbot) on May 31, 2012 at 01:58 UTC
`perl -F';' -ane'open F, ">>$F[0].csv" and print F' junk.dat` [download]	[reply] [d/l]
Re^3: AWK? Split one file in seperate files based on country by BrowserUk (Patriarch) on May 31, 2012 at 09:46 UTC
Nice golfing :) You'd have to remember to delete the output files before a re-run; and if the input is big, reopening the output files would likely be a significant factor. I kept thinking about the one-arg open, but I couldn't make it work with an array element. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^4: AWK? Split one file in seperate files based on country by jwkrahn (Abbot) on May 31, 2012 at 10:53 UTC
Re: AWK? Split one file in seperate files based on country by choroba (Cardinal) on May 30, 2012 at 22:16 UTC
Sorry, no awk. Here is a Perl solution. It reads the whole file into a hash, each line is stored at its country. Then, output files are created and their lines are saved to them. `#!/usr/bin/perl use strict; use warnings; my %countries; open my $IN, '<:encoding(utf-8)', '1.csv' or die $!; while (<$IN>) { my @columns = split /;/; push @{ $countries{$columns[0]} }, $_; } for my $country (keys %countries) { open my $OUT, '>:encoding(utf-8)', "$country.csv" or die $!; for (@{ $countries{$country} }) { print {$OUT} $_; } close $OUT or die $!; }` [download] I used UTF-8 encoding for the accented characters. You might need to change the encoding if your input file uses a different one.	[reply] [d/l]
Re^2: AWK? Split one file in seperate files based on country by Janwhatever (Novice) on May 31, 2012 at 08:28 UTC
Hi, i'm trying to understand what you created and piece by piece i'm putting it together. But when i run it i get the following error. `utf8 "\xEB" does not map to Unicode at C:\bla\bla\bla line 9, <$IN> line XXXX.` So i'm guessing this has something to do with the encoding. I'm not sure what kind of encoding i use. Is there a way to look this up in my file? Or is it possible to remove the encoding part? `, '<:encoding(utf-8)'` so it will read it as a normal file perhaps? Because i didn't need it before. @ BrowserUk / jwkrahn: i dont have enough experience to work with your answers i'm afraid edit: When i remove the encoding part i get this: No such file or directory at Z:\Data-Content\Data\test\jan\ALL_DATA\ori.pl line 13, <$IN> line 50001. (line 50001 being the end of the input file). What am i doing wrong?	[reply] [d/l] [select]
Re^3: AWK? Split one file in seperate files based on country by choroba (Cardinal) on May 31, 2012 at 08:41 UTC
Have you removed the : as well? The open should look like `open my $IN, '<', '1.csv' or die $!;` [download]	[reply] [d/l]
Re^4: AWK? Split one file in seperate files based on country by Janwhatever (Novice) on May 31, 2012 at 09:02 UTC
Re^5: AWK? Split one file in seperate files based on country by choroba (Cardinal) on May 31, 2012 at 09:11 UTC
Some notes below your chosen depth have not been shown here
Re: AWK? Split one file in seperate files based on country (awkmonks) by runrig (Abbot) on May 30, 2012 at 23:44 UTC
Sometimes, awk is easiest: `awk -F";" '{print >$1 ".csv"}' file.csv` [download]	[reply] [d/l]
Re: AWK? Split one file in seperate files based on country by GrandFather (Saint) on May 31, 2012 at 10:02 UTC
Depending somewhat on how the file was created it may not be sufficient to simplistically parse each line by splitting on semi-colons. If it is a well formed csv file you should use one of the CSV modules such as Text::CSV to manage parsing the file. Consider: use strict; use warnings; use Text::CSV qw{}; # Create a test file my $fname = 'delme.csv'; open my $fOut, '>', $fname or die "Can't create $fname: $!\n"; print $fOut <<'TESTDATA'; country_name;region;city;type;area;lat;long;etc Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Castle Park;45,769969;19,9999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0; Andorra;Andorra;El Serrat;Hotel;Subi;43,658607;5,568623;;ok;0; Andorra;Andorra;El Serrat;Hotel;Tristan;42,618507;1,538923;;ok;0; Andorra;Andorra;El Tarter;Hotel;Del;42,580340;1,648919;;ok;0; België;Antwerpen;Antwerpen;Bezienswaardigheid;Boerentoren (Kbc Toren); +51,2000;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Borze;51,2200;4,399000;;ok;0; België;Antwerpen;Antwerpen;Bed&Breakfast;Boulevard ;51,220000;4,399000 +;;ok;0; België;Antwerpen;Antwerpen;Appartement;Britselei 37;51,220000;4,399000 +;;ok;0; Engeland;Groot Londen;Londen;Museum;Tate Britain;51,5200;-0,126236;;ok +;0; Engeland;Groot Londen;Londen;Museum;Tate Modern;51,500200;-0,126236;;o +k;0; Engeland;Groot Londen;Londen;Hotel;Testing;51,500200;-0,126236;;ok;0; Engeland;Groot Londen;Londen;Bezienswaardigheid;Thames (rivier);51,020 +0;-0,126236;;ok;0; Engeland;Groot Londen;Londen;Hotel;The Archery London;51,500200;-0,126 +236;;ok;0; Finland;Finland;Ylläsjärvi;Hotel;Yllasrinne;67,526750;24,275630;;ok;0; Finland;Finland;Ypäjä;Bezienswaardigheid;Loimijoki Golf;0,805300;23,27 +1400;;ok;0 Finland;Finland;Ytteresse;Vakantiepark;Solhaga;0,630700;22,962500;;ok; +0 Finland;Lapland;Hetta;Hotel;Hetan Majatalo;6,34700;23,633800;;ok;0 Frankrijk;Alsace (Elzas);Rouffach;Hotel;Au Relais D'Alsace;7,959680;7, +29970;;ok;0 Frankrijk;Alsace (Elzas);Rouffach;Hotel;Château d Isenbourg;7,959680;7 +,29970;;ok;0 Frankrijk;Alsace (Elzas);Ruederbach;Vakantiehuis;Au Cheval Blanc;47,56 +1440;7,26840;;ok;0 Frankrijk;Alsace (Elzas);Sausheim;Camping;Le fary;4,787300;7,36390;;ok +;0; TESTDATA close $fOut; # Now for the sample interesting code open my $fIn, '<', $fname or die "Can't open $fname: $!\n"; my $csv = Text::CSV->new({binary => 1, sep_char => ';'}); my %byCountry; my @columns = @{$csv->getline($fIn)}; push @{$byCountry{$_->[0]}}, $_ while $_ = $csv->getline($fIn); close $fIn; for my $country (sort keys %byCountry) { print "--- $country.csv---\n"; $csv->print (STDOUT, \@columns); print "\n"; for my $row (@{$byCountry{$country}}) { $csv->print (STDOUT, $row); print "\n"; } } [download] Prints (in part): --- Albanië.csv--- country_name;region;city;type;area;lat;long;etc Albanië;Albanië;Berat;Hotel;Berati;40,9999;19,99999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;"Castle Park";45,769969;19,9999;Meer;ok;0; Albanië;Albanië;Berat;Hotel;Mangalemi;40,709999;19,959999;Meer;ok;0; --- Andorra.csv--- country_name;region;city;type;area;lat;long;etc Andorra;Andorra;"El Serrat";Hotel;Subi;43,658607;5,568623;;ok;0; ... Finland;Finland;Ypäjä;Bezienswaardigheid;"Loimijoki Golf";0,805300;23, +271400;;ok;0 Finland;Finland;Ytteresse;Vakantiepark;Solhaga;0,630700;22,962500;;ok; +0 Finland;Lapland;Hetta;Hotel;"Hetan Majatalo";6,34700;23,633800;;ok;0 --- Frankrijk.csv--- country_name;region;city;type;area;lat;long;etc Frankrijk;"Alsace (Elzas)";Rouffach;Hotel;"Au Relais D'Alsace";7,95968 +0;7,29970;;ok;0 Frankrijk;"Alsace (Elzas)";Rouffach;Hotel;"Château d Isenbourg";7,9596 +80;7,29970;;ok;0 Frankrijk;"Alsace (Elzas)";Ruederbach;Vakantiehuis;"Au Cheval Blanc";4 +7,561440;7,26840;;ok;0 Frankrijk;"Alsace (Elzas)";Sausheim;Camping;"Le fary";4,787300;7,36390 +;;ok;0; [download] True laziness is hard work	[reply] [d/l] [select]
Re: AWK? Split one file in seperate files based on country by Corion (Patriarch) on May 31, 2012 at 08:37 UTC
Also see part - split up files according to column value, which could handle the heavy lifting. It has no provisions for handling UTF-8 data or other encodings, so you might want to modify it to open the input file with the appropriate encoding: `binmode ARGV, ':utf8'; # if your input is encoded as UTF-8` [download] For generating the output filenames, I prefer ASCII filenames, so I'd use Text::CleanFragment or the underlying Text::Unidecode to generate filesystem-clean filenames from values.	[reply] [d/l]
Re: AWK? Split one file in seperate files based on country by Janwhatever (Novice) on May 30, 2012 at 22:51 UTC
Wow, thanks for the quick reply to all! I'll get right into it.. Hope i can figure it out! thx!	[reply]