Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Help Sorting a CSV File

by NorthernFox (Initiate)
on Oct 04, 2019 at 14:33 UTC ( [id://11107044]=perlquestion: print w/replies, xml ) Need Help??

NorthernFox has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm hoping someone can help me, I've never used Perl before and I am more used to SQL and MS Access, I've just inherited a system that can incorporate Perl scripts to enable sorting of data etc. I have a CSV file that I have created from another format of file and it looks something like the following.

CAT_HEADER,SUPPLIER_CODE,CUSTOMER_CODE
CAT_LINE,0001P,ABC12345,20190924,,1,Z,3.36
CAT_LINE,0002P,ABC12345,20190924,,1,Z,3.36
CAT_LINE,0001P,ABC23456,20190924,,1,Z,2.24
CAT_LINE,0002P,ABC23456,20190924,,1,Z,2.24
CAT_LINE,0001P,ABC34567,20190924,,1,Z,2.24
CAT_LINE,0002P,ABC34567,20190924,,1,Z,2.24

The 0001P and 0002P are price bands and the ABC123... etc are product codes.

I want to sort the so that all of the price band lines are grouped together.

So what I want to end up with is:

CAT_HEADER,SUPPLIER_CODE,CUSTOMER_CODE
CAT_LINE,0001P,ABC12345,20190924,,1,Z,3.36
CAT_LINE,0001P,ABC23456,20190924,,1,Z,2.24
CAT_LINE,0001P,ABC34567,20190924,,1,Z,2.24
CAT_LINE,0002P,ABC12345,20190924,,1,Z,3.36
CAT_LINE,0002P,ABC23456,20190924,,1,Z,2.24
CAT_LINE,0002P,ABC34567,20190924,,1,Z,2.24

I have the following basic script that can re-sort based on the column number in the CSV file which is:

=======================================================================================================

use strict;
use warnings;

my $fSource;
my $fDest;

if (! open($fSource, "<", $ARGV[0])) {
printf("Unable to open file %s to read", $ARGV[0]);
exit(1);
}
if (! open($fDest, ">", $ARGV1)) {
printf("Unable to open file %s to write\n", $ARGV1);
exit(1);
}

# Sort the file

print("Sorting data by price band\n");
print $fDest sort { (split ',', $a)1 cmp (split ',', $b)1 } <$fSource>;
print("Sort complete\n");

close($fSource);
close($fDest);
=======================================================================================================

This almost works but it includes my first record so what I end up with is:

CAT_LINE,0001P,ABC12345,20190924,,1,Z,3.36
CAT_LINE,0001P,ABC23456,20190924,,1,Z,2.24
CAT_LINE,0001P,ABC34567,20190924,,1,Z,2.24
CAT_LINE,0002P,ABC12345,20190924,,1,Z,3.36
CAT_LINE,0002P,ABC23456,20190924,,1,Z,2.24
CAT_LINE,0002P,ABC34567,20190924,,1,Z,2.24
CAT_HEADER,SUPPLIER_CODE,CUSTOMER_CODE

I need the header record to always be the first, so essentially I need my script to ignore/skip the first record.

Any help would be appreciated - thank you.

Replies are listed 'Best First'.
Re: Help Sorting a CSV File
by hippo (Bishop) on Oct 04, 2019 at 14:50 UTC
    This almost works but it includes my first record

    Notwithstanding 1nickt's excellent advice (which you should heed) there's a trivial adjustment to your script which should do what you want. Just read and print the first line before you sort the rest.

    print $fDest <$fSource>;

    Untested* but simple enough that hopefully the gist is obvious. BTW, it will help your cause for future posts if you wrap your code in <code> ... </code> tags as I have done here.

    *Update: As AnomalousMonk has pointed out below (++) you'll need to enforce scalar context here.

      print $fDest <$fSource>;

      Reading from  <> (see readline) in the list context imposed by the argument list of print will read all lines from the filehandle. For reading just one line, use something like:
          print $fDest scalar <$fSource>;
      to impose scalar context (see scalar).


      Give a man a fish:  <%-{-{-{-<

Re: Help Sorting a CSV File
by clueless newbie (Curate) on Oct 04, 2019 at 15:09 UTC
    If you're doing a lot of CSV stuff you probably want to look at DBD::CSV.
    use DBI; use 5.014; use warnings; # See "Creating database handle" below my $dbh = DBI->connect ("dbi:CSV:", undef, undef, { f_ext => ".csv/r", RaiseError => 1, }) or die "Cannot connect: $DBI::errstr"; my $sth=$dbh->prepare("select * from data order by SUPPLIER_CODE;"); $sth->execute(); local $"=q{,}; while (my $values_aref=$sth->fetchrow_arrayref ) { #warn Data::Dumper->Dump([\$values_aref],[qw(*values)]),' '; print "@$values_aref\n"; };

      THIS. If you have data, treat it as data. It's a bit more work to set up the first time, but after that it's so much easier - switching from csv to sqlite to mysql to Db2 to ... doesn't matter, it's so much easier once you see your data as actual data rather than a file format to fight with.

      Life has taken me far afield from Perl (mostly C# and Javascript nowadays), but there's so much I've learned in perl that is relatively straight forward in other languages. Except this one: treating a csv file as a source for SQL without having to import it into an actual database system. There's lots I miss about perl, and many things in other languages that are the same (with a different, usually longer, syntax), but this one tops my list. You're using perl, take advantage of it.

        I agree.

        However, there is a another step in-between text and database, in the form of a "Foreign Data Wrapper", to let the database serve up the underlying (csv-)textfile:

        psql -qX << SQL_TXT create server if not exists temp_fs foreign data wrapper file_fdw; drop foreign table if exists ft_pm2 cascade; create foreign table ft_pm2 ( "CAT_HEADER" text , "SUPPLIER_CODE" text , "CUSTOMER_CODE" text , "F4" text , "F5" text , "F6" text , "F7" text , "F8" numeric ) server temp_fs options ( format 'csv' , header 'true' , filename '/tmp/data.txt' ); copy(select * from ft_pm2 order by 2, 3) to stdout with (format csv, h +eader true); SQL_TXT

        Results are (again):

        CAT_HEADER,SUPPLIER_CODE,CUSTOMER_CODE,F4,F5,F6,F7,F8 CAT_LINE,0001P,ABC12345,20190924,,1,Z,3.36 CAT_LINE,0001P,ABC23456,20190924,,1,Z,2.14 CAT_LINE,0001P,ABC34567,20190924,,1,Z,12.23 CAT_LINE,0002P,ABC12345,20190924,,1,Z,4.26 CAT_LINE,0002P,ABC23456,20190924,,1,Z,1.21 CAT_LINE,0002P,ABC34567,20190924,,1,Z,22.24

        Such SQL using a 'Foreign Table' reads the underlying csv/text file(s). The annoying part is setting up the table , for instance in the case of tables with hundreds of columns you need a separate little application to do that, and this is where perl comes in handy (reading the header line and turning it into the CREATE TABLE column-list).

        Weaknesses and strengths:

        Advantages: SQL access to csv data. DBD::CSV delivers SQL as in SQL::Statement::Syntax which is nice but basic. The Data Wrapper's Foreign Table gives you access to the data via the full postgres SQL engine: Window Functions, Grouping Sets/Cube/Rollup, Generated Columns, Partitioning , etc., etc.)

        Disadvantages: Needs Postgres, and with extension file_fdw installed. Table remains read-only. No indexing possible, so that huge csv-files can make it slow (alhough 'copying' onto a materialized view [3] on the foreign table makes of course indexing possible again). Filesystem access for the postgres superuser is necessary.

        It all depends what you want to do.

        [1] file_fdw - this foreign data wrapper reads text files

        [2] Create Server

        [3] Create Foreign Table

        [4] Create Materialized View

Re: Help Sorting a CSV File
by 1nickt (Canon) on Oct 04, 2019 at 14:40 UTC

    Hi welcome to Perl, the One True Religion.

    Use the right tools.

    Sure, you can do it with split. You can hang yourself if you have enough rope, which Perl will happily give you. Other forebears have made a simple, safe and powerful tool for most standard tasks. Do not buy the lie that it's too much work or too difficult or not permitted or not needed or not performant enough to go to the CPAN and use the right tool for the job.

    Use the right tool, and dedicate your time to solving your own program/logic flow.

    Hope this helps!


    The way forward always starts with a minimal test.
Re: Help Sorting a CSV File
by AnomalousMonk (Archbishop) on Oct 04, 2019 at 17:09 UTC

    There are utilities available outside of Perl that can do what the OP is asking about: *nix sort (which can do multi-field/multi-key sorting and may be able to handle very large files), head, tail and cat. These are also available as "Unix utilities for Windows", which is what I'm using, but you'll have to Google for an appropriate package of these binaries. Here's a Windows command-line session using your OPed example .csv file (BTW: Code tags should be used for input/output and data as well as code):

    I'm sure someone with more shell-fu than I can pipe this all together into a one-liner, and you can always put it into a shell script.


    Give a man a fish:  <%-{-{-{-<

Re: Help Sorting a CSV File
by dbuckhal (Chaplain) on Oct 05, 2019 at 05:16 UTC
    maybe?
    #!/usr/bin/perl use strict; use warnings; my $file = shift or die; # read csv file... my $hash; open my $fh, "<", $file; my $result = <$fh>; # grab header... while ( my $line = <$fh> ) { # then read rest of file. push @{$hash->{ [ split /,/, $line ]->[1] }}, $line; } $result .= join '', @{$hash->{$_}} for sort keys $hash; print $result; __output__ CAT_HEADER,SUPPLIER_CODE,CUSTOMER_CODE CAT_LINE,0001P,ABC12345,20190924,,1,Z,3.36 CAT_LINE,0001P,ABC23456,20190924,,1,Z,2.24 CAT_LINE,0001P,ABC34567,20190924,,1,Z,2.24 CAT_LINE,0002P,ABC12345,20190924,,1,Z,3.36 CAT_LINE,0002P,ABC23456,20190924,,1,Z,2.24 CAT_LINE,0002P,ABC34567,20190924,,1,Z,2.24
Re: Help Sorting a CSV File
by erix (Prior) on Oct 04, 2019 at 20:28 UTC

    Admittedly just for my own fun I did it via a database (postgres 12.0). Don't do this at home --- although it isn't actually too bad, and only /just/ a bit more than a one-liner... I added the missing column names as I assume they are normally there in the real data.

    #!/bin/bash echo "CAT_HEADER,SUPPLIER_CODE,CUSTOMER_CODE,F4,F5,F6,F7,F8 CAT_LINE,0001P,ABC12345,20190924,,1,Z,3.36 CAT_LINE,0002P,ABC12345,20190924,,1,Z,3.36 CAT_LINE,0001P,ABC23456,20190924,,1,Z,2.24 CAT_LINE,0002P,ABC23456,20190924,,1,Z,2.24 CAT_LINE,0001P,ABC34567,20190924,,1,Z,2.24 CAT_LINE,0002P,ABC34567,20190924,,1,Z,2.24" > data.txt head -n 1 data.txt | perl -ne 'chomp; print " drop table if exists pm11107044 ; create table pm11107044 (" . join(",", map {"\"$_\" text"} split(/,/, +$_)) . ");"; ' | psql -qX && < data.txt psql -qXc "copy pm11107044 from stdin with +(format csv, header true);" echo "-- unordered data.txt:" cat data.txt echo echo "-- ordered data:" echo "select * from pm11107044 order by 2, 3" | psql -qX --csv # | m +d5sum echo # older psql doesn't have --csv (introduced in postgres 13); in that + case use: # echo "copy(select * from pm11107044 order by 2, 3) to stdout with (f +ormat csv, delimiter ',', header true)" | psql -qX

    Output:

    ./pm.pl -- unordered data.txt: CAT_HEADER,SUPPLIER_CODE,CUSTOMER_CODE,F4,F5,F6,F7,F8 CAT_LINE,0001P,ABC12345,20190924,,1,Z,3.36 CAT_LINE,0002P,ABC12345,20190924,,1,Z,3.36 CAT_LINE,0001P,ABC23456,20190924,,1,Z,2.24 CAT_LINE,0002P,ABC23456,20190924,,1,Z,2.24 CAT_LINE,0001P,ABC34567,20190924,,1,Z,2.24 CAT_LINE,0002P,ABC34567,20190924,,1,Z,2.24 -- ordered data: CAT_HEADER,SUPPLIER_CODE,CUSTOMER_CODE,F4,F5,F6,F7,F8 CAT_LINE,0001P,ABC12345,20190924,,1,Z,3.36 CAT_LINE,0001P,ABC23456,20190924,,1,Z,2.24 CAT_LINE,0001P,ABC34567,20190924,,1,Z,2.24 CAT_LINE,0002P,ABC12345,20190924,,1,Z,3.36 CAT_LINE,0002P,ABC23456,20190924,,1,Z,2.24 CAT_LINE,0002P,ABC34567,20190924,,1,Z,2.24

    ( Another (actually more appropriate) database way would be to read the data file directly as text via file_fdw. Maybe I'll have a go at that tomorrow. )

    Edit: I just realised that --csv is in postgresql 13 -- the alternative for earlier versions is mentioned ( COPY (select ...) ... )

Re: Help Sorting a CSV File
by dasgar (Priest) on Oct 09, 2019 at 14:47 UTC

    What came to my mind was to use Data::Table. It will import from a CSV, can handle a header row, and it can sort by one or more columns. If I understood your post correctly, then it looks like this module will do everything that you're wanting to do.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11107044]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-03-29 13:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found