chuckd has asked for the wisdom of the Perl Monks concerning the following question:

Hi, This isn't a Perl question but because there are many talented individuals in this forum I thought someone might be able to help.
Our company uses LAW PreDiscovery to extract text from different file formats (.doc, PDF, .MSG, etc.) LAW is a tool that uses SQL Server as a back end DB. First LAW extracts/opens the file and then extracts the text and other info into the SQL Server tables. I'm looking for a way to speed things up a bit and I was wondering what clustering options I can explore? I know there are different types of clusters and they are used for different things. Does anyone have any experience with clusters? Does anyone know how I could use a cluster to speed up a desktop application that uses SQL Server as a back end app.?

Replies are listed 'Best First'.
Re: (OT) question about clustering
by kvale (Monsignor) on Oct 01, 2008 at 22:43 UTC
    I've done quite a bit of programming on Linux clusters. Some of that programming has even been with perl. Extracting and digesting documents is a trivially parallelizable problem--just distribute documents among the different machines and let each one chug away.

    The bottlenecks are (probably) in transfering the documents into the cluster nodes and in pushing the resulting data out to an SQL Server backend. For the first bottleneck, speed depends on the network into the cluster if you are accessing it remotely, and network speed of the cluster itself. If you have slow disks, this can be a bottleneck, too. For the second bottleneck, talk to an SQL Server DBA about the effective bandwidth of your server and/or clustering solutions for the database itself.

    This sort of thing is done by Google all the time on huge clusters, so it may pay to to look at their map-reduce paradign for distributed document processing. Here, the map part is the text extraction and the reduce part is getting all that information organized in an SQL server database.

    -Mark

      Hi Mark, It seems like you might have experience with text extraction based on your reply to my post. I posted another thread in PerlMonks can you give me any advice on this post below:
      I'm looking for someone who might have advice on building a file extraction tool. My company currently uses LAW PreDiscovery to extract text and metadata from files like .msg, .doc, pdf, jpg, gif, etc, etc, etc. This software is an out of the box tool that has many limitations that cause problems for me and other engineers in our group. So, we have been thinking about writing our own tool. I've looked at different modules on CPAN and found many things that I think might help, but don't know if they are any good. Does anyone have any experience builing or writing tools for extracting text from files???? If so what did you use, how did you do it, how big was your project, did you use modules, did you use any API's or .dll's from Microsoft for all of the Microsoft files (.doc, .ppt, .xls, etc.)???

      We are looking to build an in house tool to do all our extraction and need advice on where to start.
        Most of my experience with regards to text extraction is in the context of parsing and extracting data and metadata from files produced in scientific experiments. With the exception of xls documents, these are custom formats for which I would create compilers to extract and transform what text I wanted.

        File formats are little computer languages in disguise. So the general approach of creating a compiler from the format you start with to the format you want will always work in general. In practice, writing compilers for each format can be an arduous process made difficult by incomplete file format specifications, eg, .doc format.

        In your case, you are ETL'ing standard, albeit very different formats. If I were you, I would take advantage of the programs that create these formats to do the extraction. Use Microsoft Word to convert .docs into plain text format. Use Excel to convert .xls files to CSV files. These can be scripted easily enough using VBA/Visual Basic/Visual C# and will extract all the text there is to extract in documents. From there, it is easy to write perl grograms to transform the resulting text to your custom needs.

        -Mark

Re: (OT) question about clustering
by JavaFan (Canon) on Oct 01, 2008 at 21:42 UTC
    I've experience with cluster software from SUN, HP, IBM and Veritas.

    First thing you need to do is make a cost analysis. Clustering isn't something you implement on a rainy afternoon. You need hardware, software, and training. And clustering doesn't necessarely improve performance (in fact, performance isn't the usual reason for clustering - reliability is more important). For the problem you describe, I highly doubt clustering is the solution.

    And if you think it is, don't go to Perlmonks for answers. Talk to a couple of vendors that provide solutions for your platform. Ask what they can do for you, and how much it'll cost.

Re: (OT) question about clustering
by Corion (Patriarch) on Oct 02, 2008 at 07:02 UTC

    From your vague description of the problem, I think the simplest approach that could work is to simply duplicate the LAW machines and give each LAW machine doing the extraction its own SQL server. After each LAW machine has done the extraction, merge the tables.

    Of course, this relies on the idea that each document can be treated independently from all the other documents, but you need to be far more forthcoming with your application structure, your workflow and your bottleneck analysis before we can actually help you.

Re: (OT) question about clustering
by jethro (Monsignor) on Oct 01, 2008 at 22:49 UTC

    ++ to JavaFan for his answer. Instead of clustering you might check whether your database can do replication. You have more than one database and each has the same data (but beware, maybe MS calls replication database clustering).

    Ideally you can split the workload evenly between those databases, otherwise you also need a load balancer to distribute the load. Google for sql load balancing and you'll find some microsoft pages with more info.

    Disclaimer: I'm speaking generally, I have no experience at all with SQL Server.

      Replication can often be an answer if the applications do a lot of reading (of course, replication isn't an answer to a performance problem if the database isn't the bottleneck).

      But if you do a lot of writes (which is what the OP is doing if I understand him correctly) using replication can be quite tricky, and even if you manage to set up your replication schemes correctly (replicating both ways is far from trivial), and have changed your application so it can deal with such a scheme, it may not give a significant speed up.

      Note that the OP is talking about a desktop application. Performance problems for such applications are typically not solved with replication or clustering.

      I think the OP should first do some performance analysis to determine where the bottleneck is.

Re: (OT) question about clustering
by dHarry (Abbot) on Oct 02, 2008 at 12:28 UTC

    I fully agree with brother Javafan++, clustering is probably not going to do you any good. My personal experience with clustering is not too positive (read: a lot of effort ending up with "virtual fail-over").

    I am not sure if replication is the solution though. It seems you don’t want to stress your DB and the idea is to make a copy of it to do your work against? There are several ways of doing that. It can be as simple as dumping the database and rebuilding it elsewhere. How big is the database? How up-to-date does the data need to be? Maybe you have to think in the direction of a data warehouse solution?

    If you use SQL Server there are (decent) replication solutions available. You can check MSDN for a starter. I have to add that replication solutions always take * effort * to implement and that it often has a negative impact on overall performance as well.