Re^6: Hash of Hashes from file

Replies are listed 'Best First'.
Re^7: Hash of Hashes from file by Cristoforo (Curate) on Apr 06, 2012 at 02:17 UTC
I got the following output: C:\Old_Data\perlp>perl t33.pl david Website: www.facebook.com, Category: Social Networking john Website: www.yahoo.com, Category: Entertainment Website: www.yahoo.com, Category: Entertainment Website: www.yahoo.com, Category: Entertainment Website: www.facebook.com, Category: Social Networking mike Website: www.google.com, Category: Search Engines Name: john Website Count www.yahoo.com 3 www.facebook.com 1 Type Count Entertainment 3 Social Networking 1 Name: mike Website Count www.google.com 1 Type Count Search Engines 1 Name: david Website Count www.facebook.com 1 Type Count Social Networking 1 [download] From this data: `user="john" website="www.yahoo.com" type="Entertainment" user="john" website="www.yahoo.com" type="Entertainment" user="john" website="www.yahoo.com" type="Entertainment" user="david" website="www.facebook.com" type="Social Networking" user="john" website="www.facebook.com" type="Social Networking" user="mike" website="www.google.com" type="Search Engines"` [download] Notice that there are quotes surrounding every field. The regular expression that captures these fields from the file would need to be changed if thats not the case. In my program I use 2 hashes - one to count the number of sites visited by each user, `%count`, and one to count each address and category (by user), `%data`. It seems to work OK for this small data set. #!/usr/bin/perl use strict; use warnings; my (%data, %count); while (<DATA>) { my ($user, $site, $cat) = /"([^"]+)"/g; $data{$user}{ qq{$site$;$cat} }++; $count{$user}++; } for my $user (sort keys %data) { my $href = $data{$user}; print $user, "\n"; for my $key (keys %$href) { my $str = sprintf "\tWebsite: %s, Category: %s\n", split /$;/, + $key; print $str x $href->{$key}; } } my @ordered = sort {$count{$b} <=> $count{$a}} keys %count; print "\n\n"; for my $user (@ordered) { my $href = $data{$user}; print "Name: $user\n\tWebsite Count\n"; for my $key (sort {$href->{$b} <=> $href->{$a}} keys %$href) { printf "\t%-20s%d\n", (split /$;/, $key)[0], $href->{$key}; } print "\n"; print "\tType Count\n"; for my $key (sort {$href->{$b} <=> $href->{$a}} keys %$href) { printf "\t%-20s%d\n", (split /$;/, $key)[1], $href->{$key}; } print "\n\n"; } [download] The line `$data{$user}{ qq{$site$;$cat} }++;` uses a 'compound' key ($site and $cat joined by $;). Here is a dump of `%data`. `$VAR1 = { 'john' => { 'www.yahoo.com∟Entertainment' => 3, 'www.facebook.com∟Social Networking' => 1 }, 'mike' => { 'www.google.com∟Search Engines' => 1 }, 'david' => { 'www.facebook.com∟Social Networking' => } };` [download] Update: Whoops, that doesn't count the categories correctly :-( If there was another site with the same category, it wouldn't be totaled with the same category from another site.	[reply] [d/l] [select]
Re^8: Hash of Hashes from file by Cristoforo (Curate) on Apr 06, 2012 at 15:08 UTC
Think I got it this time! I used the data structure that scorpio17 used (I printed out the dump here also). Output: C:\Old_Data\perlp>perl t33.pl $VAR1 = { 'john' => { 'site' => [ 'www.yahoo.com', 'www.yahoo.com', 'www.yahoo.com', 'www.facebook.com' ], 'type' => [ 'Entertainment', 'Entertainment', 'Entertainment', 'Social Networking' ] }, 'mike' => { 'site' => [ 'www.google.com' ], 'type' => [ 'Search Engines' ] }, 'david' => { 'site' => [ 'www.facebook.com' ], 'type' => [ 'Social Networking' ] } }; david Website: www.facebook.com, Category: Social Networking john Website: www.yahoo.com, Category: Entertainment Website: www.yahoo.com, Category: Entertainment Website: www.yahoo.com, Category: Entertainment Website: www.facebook.com, Category: Social Networking mike Website: www.google.com, Category: Search Engines Name: john Website Count www.yahoo.com 3 www.facebook.com 1 Type Count Entertainment 3 Social Networking 1 Name: mike Website Count www.google.com 1 Type Count Search Engines 1 Name: david Website Count www.facebook.com 1 Type Count Social Networking 1 C:\Old_Data\perlp> [download] And here is the code: #!/usr/bin/perl use strict; use warnings; my %data; while (<DATA>) { my ($user, $site, $cat) = /"([^"]+)"/g; push @{ $data{$user}{site} }, $site; push @{ $data{$user}{type} }, $cat; } for my $user (sort keys %data) { my $site_ary = $data{$user}{site}; my $type_ary = $data{$user}{type}; print $user, "\n"; for my $i (0 .. $#$site_ary) { printf "\tWebsite: %s, Category: %s\n", $site_ary->[$i], $type +_ary->[$i]; } } print "\n\n"; for my $user (sort by_count_desc keys %data) { my $site_ary = $data{$user}{site}; my $type_ary = $data{$user}{type}; my (%site_cnt, %type_cnt); $site_cnt{$_}++ for @$site_ary; $type_cnt{$_}++ for @$type_ary; print "Name: $user\n\tWebsite Count\n"; for my $site (sort {$site_cnt{$b} <=> $site_cnt{$a}} keys %site_cn +t) { printf "\t%-20s%d\n", $site, $site_cnt{$site}; } print "\n"; print "\tType Count\n"; for my $type (sort {$type_cnt{$b} <=> $type_cnt{$a}} keys %type_cn +t) { printf "\t%-20s%d\n", $type, $type_cnt{$type}; } print "\n\n"; } sub by_count_desc { @{$data{$b}{site}} <=> @{$data{$a}{site}}; } __DATA__ user="john" website="www.yahoo.com" type="Entertainment" user="john" website="www.yahoo.com" type="Entertainment" user="john" website="www.yahoo.com" type="Entertainment" user="david" website="www.facebook.com" type="Social Networking" user="john" website="www.facebook.com" type="Social Networking" user="mike" website="www.google.com" type="Search Engines" [download] Hope this helps, Chris Update: Added sub by_count_desc.	[reply] [d/l] [select]
Re^9: Hash of Hashes from file by Cristoforo (Curate) on Apr 14, 2012 at 14:54 UTC
Hi I thought I'd post a database solution. Its not really necessary as the Perl code solution works. The advantage of loading into a database is if your file is too large to fit in memory. Also, if you wanted to see different views of the data, it would be probably easier to write an SQL query than to write another program, etc. I can't vouch for the SQL here - I don't use it often, but it did produce the results similiar to the Perl program above. You could run it if you had the DBI and DBD::SQLite modules on your system. The first program creates the database and the second program runs the queries. #!/usr/bin/perl use strict; use warnings; use DBI; my $dbh = DBI->connect("dbi:SQLite:dbname=users.lite","","", {PrintError => 1, AutoCommit => 0}) or die "Can't connect"; $dbh->do('DROP TABLE users'); $dbh->do(qq{ CREATE TABLE users (user TEXT, site TEXT, type TEXT) }); my $sql_fmt = "INSERT INTO users VALUES(?,?,?)"; while(<DATA>) { $dbh->do($sql_fmt, {}, /"([^"]+)"/g); $dbh->commit if $. % 1_000_000 == 0; # commit every 1,000,000 } $dbh->commit; $dbh->disconnect; __DATA__ user="john" website="www.yahoo.com" type="Entertainment" user="john" website="www.yahoo.com" type="Entertainment" user="john" website="www.yahoo.com" type="Entertainment" user="david" website="www.facebook.com" type="Social Networking" user="john" website="www.facebook.com" type="Social Networking" user="mike" website="www.google.com" type="Search Engines" [download] #!/usr/bin/perl use strict; use warnings; use DBI; my $dbh = DBI->connect("dbi:SQLite:dbname=users.lite","","", {PrintError => 1, AutoCommit => 0}) or die "Can't connect"; # Prepare and print list of all websites to every user my $sth = $dbh->prepare(<<SQL); SELECT * FROM users ORDER BY user, site SQL $sth->execute; while(my @row = $sth->fetchrow_array) { printf "%-15s%-20s%s\n", @row; } print "\n"; # Create list of users from most visits to least for @users array $sth = $dbh->prepare(<<SQL); SELECT user, COUNT(user) Count FROM users GROUP BY user ORDER BY Count DESC, user SQL $sth->execute; my @users; while(my @row = $sth->fetchrow_array) { push @users, $row[0]; } # Counts for each website and counts of categories visited by user for my $user (@users) { $sth = $dbh->prepare(qq{SELECT site, COUNT(site) Count FROM users WHERE user = '$user' GROUP BY site ORDER BY Count DESC }); $sth->execute; printf "Name: %s\n\t%-20s%s\n", $user, qw/ Website Count /; while(my @row = $sth->fetchrow_array) { printf "\t%-20s%s\n", @row; } print "\n"; printf "\t%-20s%s\n", qw/ Category Count /; $sth = $dbh->prepare(qq{SELECT type, COUNT(type) Count FROM users WHERE user = '$user' GROUP BY type ORDER BY Count DESC }); $sth->execute; while(my @row = $sth->fetchrow_array) { printf "\t%-20s%s\n", @row; } print "\n"; } $dbh->disconnect; [download] Chris Update: Re-wrote the query in loop of '@users'.	[reply] [d/l] [select]
Re^9: Hash of Hashes from file by cipher (Acolyte) on Apr 09, 2012 at 12:04 UTC
Chris, This was really helpful specially for someone like me who is new to Perl hashes. Thanks a lot, Your help is really appreciated.	[reply]

I got the following output:
C:\Old_Data\perlp>perl t33.pl david Website: www.facebook.com, Category: Social Networking john Website: www.yahoo.com, Category: Entertainment Website: www.yahoo.com, Category: Entertainment Website: www.yahoo.com, Category: Entertainment Website: www.facebook.com, Category: Social Networking mike Website: www.google.com, Category: Search Engines Name: john Website Count www.yahoo.com 3 www.facebook.com 1 Type Count Entertainment 3 Social Networking 1 Name: mike Website Count www.google.com 1 Type Count Search Engines 1 Name: david Website Count www.facebook.com 1 Type Count Social Networking 1
[download]
From this data:
user="john" website="www.yahoo.com" type="Entertainment" user="john" website="www.yahoo.com" type="Entertainment" user="john" website="www.yahoo.com" type="Entertainment" user="david" website="www.facebook.com" type="Social Networking" user="john" website="www.facebook.com" type="Social Networking" user="mike" website="www.google.com" type="Search Engines"
[download]
Notice that there are quotes surrounding every field. The regular expression that captures these fields from the file would need to be changed if thats not the case.
In my program I use 2 hashes - one to count the number of sites visited by each user, %count, and one to count each address and category (by user), %data. It seems to work OK for this small data set.
#!/usr/bin/perl use strict; use warnings; my (%data, %count); while (<DATA>) { my ($user, $site, $cat) = /"([^"]+)"/g; $data{$user}{ qq{$site$;$cat} }++; $count{$user}++; } for my $user (sort keys %data) { my $href = $data{$user}; print $user, "\n"; for my $key (keys %$href) { my $str = sprintf "\tWebsite: %s, Category: %s\n", split /$;/, + $key; print $str x $href->{$key}; } } my @ordered = sort {$count{$b} <=> $count{$a}} keys %count; print "\n\n"; for my $user (@ordered) { my $href = $data{$user}; print "Name: $user\n\tWebsite Count\n"; for my $key (sort {$href->{$b} <=> $href->{$a}} keys %$href) { printf "\t%-20s%d\n", (split /$;/, $key)[0], $href->{$key}; } print "\n"; print "\tType Count\n"; for my $key (sort {$href->{$b} <=> $href->{$a}} keys %$href) { printf "\t%-20s%d\n", (split /$;/, $key)[1], $href->{$key}; } print "\n\n"; }
[download]
The line $data{$user}{ qq{$site$;$cat} }++; uses a 'compound' key ($site and $cat joined by $;).
Here is a dump of %data.
$VAR1 = { 'john' => { 'www.yahoo.com∟Entertainment' => 3, 'www.facebook.com∟Social Networking' => 1 }, 'mike' => { 'www.google.com∟Search Engines' => 1 }, 'david' => { 'www.facebook.com∟Social Networking' => } };
[download]

Update: Whoops, that doesn't count the categories correctly :-(
If there was another site with the same category, it wouldn't be totaled with the same category from another site.

[reply]
[d/l]
[select]

scorpio17

C:\Old_Data\perlp>perl t33.pl
$VAR1 = {
          'john' => {
                      'site' => [
                                  'www.yahoo.com',
                                  'www.yahoo.com',
                                  'www.yahoo.com',
                                  'www.facebook.com'
                                ],
                      'type' => [
                                  'Entertainment',
                                  'Entertainment',
                                  'Entertainment',
                                  'Social Networking'
                                ]
                    },
          'mike' => {
                      'site' => [
                                  'www.google.com'
                                ],
                      'type' => [
                                  'Search Engines'
                                ]
                    },
          'david' => {
                       'site' => [
                                   'www.facebook.com'
                                 ],
                       'type' => [
                                   'Social Networking'
                                 ]
                     }
        };
david
        Website: www.facebook.com, Category: Social Networking
john
        Website: www.yahoo.com, Category: Entertainment
        Website: www.yahoo.com, Category: Entertainment
        Website: www.yahoo.com, Category: Entertainment
        Website: www.facebook.com, Category: Social Networking
mike
        Website: www.google.com, Category: Search Engines


Name: john
        Website Count
        www.yahoo.com       3
        www.facebook.com    1

        Type Count
        Entertainment       3
        Social Networking   1


Name: mike
        Website Count
        www.google.com      1

        Type Count
        Search Engines      1


Name: david
        Website Count
        www.facebook.com    1

        Type Count
        Social Networking   1



C:\Old_Data\perlp>
[download]

#!/usr/bin/perl
use strict;
use warnings;

my %data;

while (<DATA>) {
    my ($user, $site, $cat) = /"([^"]+)"/g;
    push @{ $data{$user}{site} }, $site;
    push @{ $data{$user}{type} }, $cat;
}

for my $user (sort keys %data) {
    my $site_ary = $data{$user}{site};
    my $type_ary = $data{$user}{type};
    print $user, "\n";
    for my $i (0 .. $#$site_ary) {
        printf "\tWebsite: %s, Category: %s\n", $site_ary->[$i], $type
+_ary->[$i];
    }
}
print "\n\n";

for my $user (sort by_count_desc keys %data) {
    my $site_ary = $data{$user}{site};
    my $type_ary = $data{$user}{type};
    my (%site_cnt, %type_cnt);
    $site_cnt{$_}++ for @$site_ary;
    $type_cnt{$_}++ for @$type_ary;
    
    print "Name: $user\n\tWebsite Count\n";
    for my $site (sort {$site_cnt{$b} <=> $site_cnt{$a}} keys %site_cn
+t) {
        printf "\t%-20s%d\n", $site, $site_cnt{$site};
    }
    print "\n";
    print "\tType Count\n";
    for my $type (sort {$type_cnt{$b} <=> $type_cnt{$a}} keys %type_cn
+t) {
        printf "\t%-20s%d\n", $type, $type_cnt{$type};
    }
    print "\n\n";
}

sub by_count_desc {
    @{$data{$b}{site}} <=> @{$data{$a}{site}};
}

__DATA__
user="john" website="www.yahoo.com" type="Entertainment"
user="john" website="www.yahoo.com" type="Entertainment"
user="john" website="www.yahoo.com" type="Entertainment"
user="david" website="www.facebook.com" type="Social Networking"
user="john" website="www.facebook.com" type="Social Networking"
user="mike" website="www.google.com" type="Search Engines"
[download]

Update: Added sub by_count_desc.

[reply]
[d/l]
[select]

I thought I'd post a database solution. Its not really necessary as the Perl code solution works. The advantage of loading into a database is if your file is too large to fit in memory. Also, if you wanted to see different views of the data, it would be probably easier to write an SQL query than to write another program, etc.

I can't vouch for the SQL here - I don't use it often, but it did produce the results similiar to the Perl program above.

You could run it if you had the DBI and DBD::SQLite modules on your system.

The first program creates the database and the second program runs the queries.

#!/usr/bin/perl
use strict;
use warnings;
use DBI;

my $dbh = DBI->connect("dbi:SQLite:dbname=users.lite","","",
    {PrintError => 1, AutoCommit => 0}) or die "Can't connect";

$dbh->do('DROP TABLE users');
$dbh->do(qq{ CREATE TABLE users
            (user TEXT,
             site TEXT,
             type TEXT)
            });

my $sql_fmt = "INSERT INTO users VALUES(?,?,?)";
while(<DATA>) {
    $dbh->do($sql_fmt, {}, /"([^"]+)"/g);
    $dbh->commit if $. % 1_000_000 == 0; # commit every 1,000,000
}

$dbh->commit;
$dbh->disconnect;

__DATA__
user="john" website="www.yahoo.com" type="Entertainment"
user="john" website="www.yahoo.com" type="Entertainment"
user="john" website="www.yahoo.com" type="Entertainment"
user="david" website="www.facebook.com" type="Social Networking"
user="john" website="www.facebook.com" type="Social Networking"
user="mike" website="www.google.com" type="Search Engines"
[download]

#!/usr/bin/perl
use strict;
use warnings;
use DBI;

my $dbh = DBI->connect("dbi:SQLite:dbname=users.lite","","",
    {PrintError => 1, AutoCommit => 0}) or die "Can't connect";

# Prepare and print list of all websites to every user
my $sth = $dbh->prepare(<<SQL);
SELECT *
FROM users
ORDER BY user, site
SQL

$sth->execute;
while(my @row = $sth->fetchrow_array) {
    printf "%-15s%-20s%s\n", @row;
}
print "\n";

# Create list of users from most visits to least for @users array
$sth = $dbh->prepare(<<SQL);
SELECT user, COUNT(user) Count
FROM users
GROUP BY user
ORDER BY Count DESC, user
SQL

$sth->execute;
my @users;
while(my @row = $sth->fetchrow_array) {
    push @users, $row[0];
}

# Counts for each website and counts of categories visited by user
for my $user (@users) {
    $sth = $dbh->prepare(qq{SELECT site, COUNT(site) Count
                            FROM users
                            WHERE user = '$user'
                            GROUP BY site
                            ORDER BY Count DESC
                    });
    $sth->execute;
    printf "Name: %s\n\t%-20s%s\n", $user, qw/ Website Count /;
    while(my @row = $sth->fetchrow_array) {
        printf "\t%-20s%s\n", @row;
    }
    print "\n";
    
    printf "\t%-20s%s\n", qw/ Category Count /;
    $sth = $dbh->prepare(qq{SELECT type, COUNT(type) Count
                            FROM users
                            WHERE user = '$user'
                            GROUP BY type
                            ORDER BY Count DESC
                    });
    $sth->execute;
    while(my @row = $sth->fetchrow_array) {
        printf "\t%-20s%s\n", @row;
    }
    print "\n";
}

$dbh->disconnect;
[download]

Chris

Update: Re-wrote the query in loop of '@users'.

[reply]
[d/l]
[select]

Chris, This was really helpful specially for someone like me who is new to Perl hashes. Thanks a lot, Your help is really appreciated.

[reply]