santhosh.yamsani has asked for the wisdom of the Perl Monks concerning the following question:

I have two tables in database which contain URLS(almost ONE LAKH each).

Both tables may contain same URLS or different URLS.

So i need to copy these URLS in two different files from two tables and i have to compare those URLs.

So I planned to LOAD the URLS from database to TWO different files.But i am getting "OUT OF MEMORY!" error.

So initially i tried copying only 100URLS from database to each file.

Following is the code:

use DBI;

### The database handle

my $var;

my $var_ex;

my $dbh = DBI->connect( "dbi:ODBC:MSSQL");

my $dbh4 = DBI->connect( "dbi:ODBC:MSSQL");

### The statement handle

my $sth = $dbh->prepare("select Top 100 URLName from URL ORDER BY URLID DESC");

$sth->execute();

my $sth4 = $dbh4->prepare("select Top 100 URLName from URL_EXCLUDE");

$sth4->execute();

print "\n starts";

open(URLFHW,">Url.txt");

while($var = $sth->fetchrow_array )

{

print URLFHW "$var\n" ;

}

close(URLFHW);

print "\n 1st file ";

open(URLEXFHW,">Url_ex.txt");

while($var_ex = $sth4->fetchrow_array )

{

print URLEXFHW "$var_ex\n";

}

close(URLEXFHW);

print "\n 2nd file";

OUTPUT:

D:\perl> Copy_urls.pl

starts 1st file Out of memory! 2nd file

Result:

Two files(Url.txt,Url_ex.txt) created with each 100 URLS.But perl program getting terminated at second step itself it is not executing further if i write any code after second step.I have written some script for comparing URLS after second step.but it is not executing.

Can any one help me in this..

!!!!...Thanks in ADVANCE...!!!!

Replies are listed 'Best First'.
Re: Out of memory!
by roboticus (Chancellor) on Aug 11, 2010 at 14:18 UTC

    santhosh.yamsani:

    Why go to the trouble of reading the tables in perl? It sounds like a problem the database is suited to. For example:

    --Get list of URLs in table 1 not present in table 2 select * from table_1 where URL not in (select URL from table_2) --Get list of URLs in both tables select T1.URL from table_1 T1 join table_2 T2 on T1.URL = T2.URL --Same, but where "description" doesn't match select T1.URL from table_1 T1 join table_2 T2 on T1.URL = T2.URL where T1.Description != T2.Description

    Moving a ton of data from the database server to perl to do simple comparisons is generally the wrong way to go. Learn SQL enough to do your work in the database when possible.

    Of course, there are times when it's appropriate to use perl to whack entire tables. But in that case, you probably want to use a bulk copy utility (BCP on Sybase and MS SQL Server) to dump the tables to a flat file. It's faster and you won't run out of memory.

    ...roboticus

      Moving a ton of data from the database server to perl to do simple comparisons is generally the wrong way to go. Learn SQL enough to do your work in the database when possible.
      That, specially the second sentence, is only true op to a point. It doesn't scale. Certainly, if you have just one process accessing the database, let the database work. But what if you have a 100 servers accessing the database, it may actually pay to do as little work as possible in the database.

        JavaFan:

        While I can imagine cases where it's better to do the job on a different machine in perl, I'll stand by my statement in this case. If I have 100 servers accessing the database, I certainly don't want one task sucking up all the network bandwidth by exporting and importing entire tables instead of just the rows it wants. That's why I put the weasel words "simple comparisons" in there. Index scans are normally cheap, especially on frequently-used tables where the index is mostly cached in RAM anyway.

        Having said that, I totally agree with your sentiment that you need to be cognizant of other database users. We definitely don't want someone to be an inconsiderate database user, frivolously consuming resources with no regard to database performance. Large-scale comparisons, for example, shouldn't be a frequent operation anyway. Proper structuring the applications can normally keep mass comparisons down to a minimum, such as for periodic maintenance.

        ...roboticus

      Hii all..!!<\p>

      Thanks for your replies <\p>

      Firstly i did this task in SQL only.But it took 40 to 45 minutes to compare just 1000 URLs.So we can expect how much time it would take for lakhs of urls.that is why i am trying to do this task using perl.More over i have to do this task weekly. So i have to find out a permanent solution for this.So can any one help in this.<\p>

        santhosh.yamsani:

        Have you tried to optimize your SQL? 40-45 minutes is quite a long time to compare 1000 URLs in *any* language, unless you're doing some amazingly complex comparisons. Do you know which statements are taking up your time?

        If you give me a few details about you table structure, indexes and SQL that's performing poorly, I can offer you some suggestions on how to make it faster.

        ...roboticus

Re: Out of memory!
by dasgar (Priest) on Aug 11, 2010 at 12:44 UTC

    If I'm reading your code correctly, you're doing the following:

    • Querying database 1
    • Querying database 2
    • Writing to file 1 using query 1
    • Writing to file 2 using query 2
    Also, it looks like your two queries and associated output files are completely independent of each other. In that case, I'd recommend doing the following:
    • Connect and query database 1
    • Write to file 1
    • Use undef on the variables related to database 1
    • Repeat above steps for database 2 and file 2
    This will limit the amount of data in memory and free up memory once you're done with that data.

      Dasgar('dbi:ODBC:MSSQL','Url.txt'); Dasgar('dbi:ODBC:MSSQL','Url_Ex.txt'); sub Dasgar { my( $dbstring, $filename ); ..DBI->new($dbstring); ...open ... $filename ... }

        Oh no! I've been reduced to a subroutine! :D

        Although I almost suggested a subroutine solution, I opted for a conceptual answer rather than going the code route. However, since he's doing slightly different SQL queries from the databases, I would recommend that the SQL query be another parameter to be passed into the subroutine.

        Otherwise, it looks like you've managed to read my mind. Now if only I can get Perl to do that, I wouldn't have to worry about making mistakes in my code.

      Thanks dasgar.

      Out of memory issue resolved for the first step(Copying of URLs to two different txt files).But the next step i mean comparing each url in 1st file with each URL_Exclusion in the second file and copying the matched Url along with exclusionID into the new file(matchURL.txt).

      After completion of second step.We can go for third step i.e., updating Id with ExclusionID from(matchURL.txt).

      OUT OF MEMORY error resolved in first step,But i am getting that error in second step

      Code for the second and third steps are given below:

      open (URLFHR, 'Url.txt'); open(URLEXFHR,"Url_ex.txt"); while ($ee=<URLFHR>) { while ($ee4=<URLEXFHR>) { #print "$ee - $ee4 \n"; $abc4=$ee4; my $sthID = $dbhID->prepare("select MAX(ID) from +DI_URL_EXCLUDE where [RULE] ='$ee4' "); $sthID->execute(); $ID = $sthID->fetchrow_array ; #print "$ID \n"; undef( $dbhID ); undef( $sthID ); if ($ee4 =~ /^%/) { $abc4=$ee4; $abc4=~ s/^%//;##first letter # print "$abc4 \n"; + } if ($ee4 =~ /%$/) { $abc4=$ee4; $abc4=~ s/%$//; ##Last letter #print "$abc4 \n"; + } $ee = quotemeta( $ee ); # To avoid error (U +nmatched ) in regex; marked by <-- HERE ),To escape the special chara +cters $abc4 = quotemeta( $abc4 ); if( ($ee) =~ (/$abc4/) ) { #print "In comparision of $ee and $ee4,$ID \ +n"; open (SIMILARURLFHW, '>>Similar_Url.txt'); print SIMILARURLFHW "$ee\{\|\|\}$ID \n"; close(SIMILARURLFHW); print "\n3"; } } } print "\n4"; my $a; while($a =<sampleFile>) { my $UrlName = substr $a,0,index($a,'{||}'); my $EXID = substr $a, index($a,'{||}')+4; my $sthUPEXID = $dbhUPEXID->prepare("UPDATE DI_URL SET EXCLUSI +ONID =$EXID where URLName = '$UrlName' ");# Updating EXID in emp with + ID in emp4 where emp_fname matches in both $sthUPEXID->execute(); }
        I' not sure i'm understood what you want, here is my piece of code, maybe it helps. If url exclude list is too, you can just split it into smaller batches and collect "match id" results (maybe some of them multiple times)
        use re 'eval'; my @url_list = qw( www.gmaisds.% www.gmai% www.go% www.gir% www.girs% +); my @list; my $retval; my $id = 0; for my $url (@url_list) { $url =~ s/(?=[\/\.\?])/\\/g; $url =~ s/%/.*/g; $url .= '(?{ use re \'eval\'; $retval = ' . $id++ . '})'; push @list, $url; } my $str = '^(?:' . join (')|(?:', @list) . ')$ '; my $re = qr($str); while (<>) { chomp; if ($_ =~ $re) { print "$_ excluded; id: $retval\n"; } }
Re: Out of memory!
by mje (Curate) on Aug 12, 2010 at 07:45 UTC

    Ignoring the other answers you've had, some of which contain some good advice, I'm still interested as to how you get out of memory with only 100 rows in each select. What is the type of the URLName column and how long is an average field? What versions of DBI and DBD::ODBC are you using?

      For Suppose Urls in URLTable will be like

      1) www.google.com

      2) www.gmail.com

      3) www.gangster.com

      4) www.gorilla.com

      5) www.giraffe.com

      6) www.girstapache.com

      Urls in URL_Exclusion Table will be like

      1) www.g%

      2) www.gmaisds.%

      3) www.gmai%

      4) www.go%

      5) www.gir%

      6) www.girs%

      Like above i have Lakhs of URls.So I need to match exact Url for example say www.google.com matches with

      www.g% and www.go%

      So exact match is www.go%

      In this way i have to update ID column in URL table with the ExclusionId of URL_Exclusion Table

      For Example as i explained above: Id of www.google.com should be updated with the ExclusionId of www.go%

      But Why its Taking long time means- -"This comparision involves LIKE' operator which compares all other URLS in URL_Exclusion table"

        What are the column types in the database tables?

        I heard perl is a powerfull tool which can handle this type of scenarios.So i am approaching this way of copying urls into file and comparing urls and updating database using perl.

        But i am unable to move a step forward as i am getting OUT OF MEMORY! error

        Thanks for your previous responses