Re: Managing a graph with large number of nodes
by bart (Canon) on Nov 01, 2009 at 22:12 UTC
|
That sounds like a typical application for a tied hash to me, the replacement of perl4's dbmopen and related functions.
As for backend, you might want to choose between the various backends as supported by AnyDBM_File; BerkeleyDB, in particular BerkeleyDB::Hash might be a good choice.
The advantage is that no rewriting of your code is necessary: it's just a hash. You will likely lose raw speed, though.
If you don't mind rewriting, you might want to look into DBD::SQLite as a backend for DBI, which gives you a SQL interface to your data. | [reply] |
Re: Managing a graph with large number of nodes
by BrowserUk (Patriarch) on Nov 01, 2009 at 23:26 UTC
|
What does a node look like?
If performance is important, before you consider paying the penalties of constructing, updating and traversing a disk-based graph, consider using a more memory efficient data structure.
If you store 10 million of the following simple nodes as:
- Array of hashes:
>perl -E"$a[$_] = {a=>1,b=>3.2,c=>'a short string'} for 1..10e6; <>"
It requires ~3.9 GB.
- Array of arrays:
>perl -E"$a[$_] = [1,3.2,'a short string'] for 1..10e6; <>"
~2.8 GB.
As an array of simple (or packed) strings:
>perl -E" $a[$_] = qq[1,3.2,'a short string'] for 1..10e6; <>"
>perl -E" $a[$_] = pack 'VdA8', 1,3.2,'a short string' for 1..10e6; <>
+"
Only 1.1 GB.
By saving 75% of the memory requirement, you may be able to fit all your data in to ram comfortably.
It means that you need to split or unpack and join or repack each time you access or modify a node, but that will be far faster than doing disk IO. You can easily hide the details of the splitting and joining behind a tied array or an OO class. And you can even add a simple caching mechanism to avoid re-splitting in a read-update sequence.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: Managing a graph with large number of nodes
by jethro (Monsignor) on Nov 01, 2009 at 21:41 UTC
|
You might use a disk based hash, DBM::Deep. Seems to me your best option without knowing any more details
Or an SQL database (lots of database interfaces around, for example DBIx::Simple, DBIx:Class), but you also would need to install a database engine.
Files for individual nodes would be a bad idea, as every access of a node would entail the opening of a file, heavy overhead I would assume.
| [reply] |
|
| [reply] |
|
Depending on the nature of the nodes and how they're going to be accessed/manipulated, an object store (such as Pixie, Tangram, or KiokuDB) seems likely to work well for something like this, while avoiding the need to worry about SQL schemas (even if the objects are ultimately stored into an SQL database).
| [reply] |
|
CREATE TABLE dbo.Users (
Id int not null identity(1,1) primary key,
Code varchar(...),
FirstAttribute ...,
SecondAttribute ...,
...
)go
CREATE INDEX idx_Users_Code ON dbo.Users(Code)
go
CREATE TABLE dbo.Links (
UserId1 int not null,
UserId2 int not null,
Weight ...,
CONSTRAINT pk_Links PRIMARY KEY CLUSTERED (UserId1, UserId2)
)go
and that's about it.
Object store is most likely an overcomplication in this case.
Jenda
Enoch was right!
Enjoy the last years of Rome.
| [reply] [d/l] |
Re: Managing a graph with large number of nodes
by Ea (Chaplain) on Nov 02, 2009 at 12:35 UTC
|
I've made a graph using the Graph module that has 0.5 million nodes and 100 thousand links. It takes > 300M of memory, so tieing it to disk or increasing the amount of swap space might work. The other possibility, depending on the nature of the graph, is to try PDL which is implemented in C and use a matrix representation for the weighted graph with other data structures to handle the various attributes.
If you get desperate enough to write it yourself, take a gander at the C++ code under Optimising "generalised modularities". For my graph, it runs very fast.
Ea
perl -e 'print qq(Just another Perl Hacker\n)' # where's the irony switch?
| [reply] |
|
I'd agree with this; the matrix approach I'd go with is an incidence one, which would be O(E), here 2*100k (col1=startpoint,col2=endpoint). The adjacency approach would be 500k*500k = 2.5e11 = 250 billion entities = minimum 30GB even if you used bitfields. Naturally, I think PDL would be a good platform for this.
| [reply] |
Re: Managing a graph with large number of nodes
by Anonymous Monk on Nov 02, 2009 at 01:33 UTC
|
May we know the type of data? May be a more elegant solution already exists. Database is a good solution. If RAM is the only problem, there could be many solutions. But you have to keep the application in mind too.
| [reply] |
Re: Managing a graph with large number of nodes
by zentara (Archbishop) on Nov 02, 2009 at 16:46 UTC
|
.... not saying its the best solution, but if i was presented with the problem, with my background, i would just use a Storable database ( memory is getting cheap...alot of systems have 4 GiG standard now).... and just load it into a well designed hash.... then use Tk, Gtk2, or Zinc to display the nodes as different colored little icons, that are interactive with the mouse.... which is a cool feature.... a graph that responds in realtime to the mouse..... but 1 million data points may overload the widget..... but there are ways around the problem, by breaking the data points into smaller sets for display..... a multi-planar layering probably could be possible too. See Tk Realtime data aquisition
| [reply] |
Re: Managing a graph with large number of nodes
by NiJo (Friar) on Nov 02, 2009 at 22:36 UTC
|
If you are just trying to plot the nodes, I'd look no further than Graphviz. It even has a Perl module. In that case you don't need all the nodes in memory. Parse your input files and transform them line by line into a large dot file. Of coure GraphViz also takes edge weights into account. | [reply] |
Re: Managing a graph with large number of nodes
by MadraghRua (Vicar) on Nov 02, 2009 at 18:11 UTC
|
I'm not sure of your ultimate app but if you are working with RDF triples and using these to draw graphs you might try RDF Store.It appears to have a lot of the functionality you are looking for. Like others I am interested in this topic too and would appreciate if you could share more information on what you need the graph for. Thanks!
MadraghRua yet another biologist hacking perl....
| [reply] |
Re: Managing a graph with large number of nodes
by spx2 (Deacon) on Nov 14, 2009 at 23:25 UTC
|
I think you want to cluster them. Also, notice that in graphs that big, the biggest problem you will face is the big number of edges crossings which you don't want( and which you will probably not be able to avoid if a planar embedding does not exist ). If you do cluster them(on which criteria you decide) you will probably have less edge crossings inside the clusters so the drawing will be less confusing to the eye. Apart from that, you can look up force based algorithms, those can progressively find a planar embedding of the graph. Graphviz can use force-based algorithms if you use the fdp layout. Also see Visualizing very large datastructures thejit Graphing huge social networks. | [reply] |