ltp has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

Long time lurker, first time poster.

As part of a backup script, I'm using rmtree to regularly remove a somewhat large (~5GB) tree containing many subdirectories and (mostly) small files.

Unfortunately, the rmtree operation is taking a longish time, even on relatively fast hardware. Is there a more optimal/efficient way in which to remove a tree?

My initial thoughts were to implement a number of worker threads each performing a rmtree operation on a lower level of the tree. The process behind implementing such a threaded tree deletion seems fairly straightforward to me but I was surprised to find no examples of others having taken similar approaches when searching (maybe my google-fu is just weak).

Do the monks have any thoughts or advice on this topic? Is there already a module implements this kind of deletion?

Replies are listed 'Best First'.
Re: Threaded tree deletion
by BrowserUk (Patriarch) on Apr 17, 2010 at 18:21 UTC
    I was surprised to find no examples of others having taken similar approaches

    Such processes are I/O-bound, there is therefore little or no benefit to be gained from using more cpu resouces to perform the process.

    If you take a look at the cpu usage of your single-threaded, process whilst it performs the task, it will probably average somewhere between 5% & 10% of a single cpu. The speed of processing is entirely limited by the speed of the disk. Throwing multiple threads at it will most likely slow it down.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thanks for the reply. I had thought that IO wait would initially be the limiting factor in a threaded rmtree operation, but some early (non-perl, but shell with multiple process(not-threaded)) experiments indicated that I may see some improvement over a recursive deletion.

      I admit ignorance as to how Perl's rmtree is implemented, but I thought there may have been some loose parallels.

      Unfortunately, after spending a couple of hours on a fine Sunday testing a (quick and dirty) threaded rmtree failed to yield the results I was looking to justify my initial query.

      Namely, the single deltree operation that seemed to be taking over two hours according to logfiles, took only 158 seconds using a threaded model, but surprisingly enough; only 222 seconds using a single, top-level rmtree operation.

      Suffice to say; I am both humbled and stumped. I've put some additional logging into the script to try to determine exactly where the bottle-neck is occurring (I'm thinking it may be due to my scarce logging practices and the compressions of said tree prior to removal). Hopefully, I'll have some better meditation fodder tomorrow.

        158 seconds using a threaded model, ..., only 222 seconds using a single, top-level rmtree operation.

        It is extremely difficult to come to any firm conclusions regarding these kinds of operations, because it is almost impossible to get reproducible results from benchmarks. There are just so many variables involved.

        For example, if you create a nested directory structure in one pass, and then immediately delete it. It is likely to run much more quickly, than if the same directory structure has built up over a period of time. This because in the former case, both the relevant entries with in the file allocation structures; and the data extents of the files themselves; will tend to be grouped close together on disk in relatively contiguous areas of the disk. Whereas in the latter case, those entries and extents will tend to be spread all over the disk, intermingled with those from other directory trees and files with little continuity. Therefore the latter causes far more random and widespread disk head movement.

        Obviously, this is also influenced by OS; filesystem; disk configuration; other systems activity; and a myriad of other factors.

        One possible source of optimisation worth the effort of trying.

        As this is in the context of a backup mechanism, the usual scenario is that first the directory is traversed to copy the contents. It is then traversed again to delete it. I've found that on many systems, it is quicker to do a move of the data from the original place to the backup medium. Because the elements are deleted as they are copied, the relevant parts of the fail allocation structures are already in memory or cache when the deletions occur, so the disk head sonly have to visit each location once instead of twice. (Or more probably twice:read then write; rather than 3 times: read, re-read, write.)

        It is possible for concurrency to speed this type of operation up under some circumstances. The problem is that those circumstances are a) unpredictable; b) unreliable; c) usually outweighed by the circumstances where the opposite is the case. One of the ways this can occur is if you have an old directory tree that has gone through many changes over its life with the results that it's constituent parts are spread widely all over the disk.

        Modern filesystems, disk drivers and even the harddisks themselves often have request reordering capabilities. That is, they tend to accumulate the requests (from different processes and threads) to move the disk head to different parts of the disk over short periods of time and then re-order them so that a later request for a read in the middle of the disk will be satisfied en-passant, on the way to satisfying an earlier request at the end of the disk.

        In the case of the widely dispersed, old directory structure, if you have 2 or 3 threads traversing different parts of the structure concurrently, there may be a wider variety of requests available for reordering in any given time period, thereby reducing the overall number of traversals required. But if and when that might happen is for all intents and purposes entirely unpredictable. And if it does happen, you just lucked out, but don't go basing your algorithms upon it.

        The final thing I would say is that if your OS provides a 'move directory' api, and that is applicable to your application, use it; if not, but it provides 'copy dir' & 'delete dir' apis. use them; in preference to rolling your own--or other people home-rolled solutions.

        Indeed, this is one of those situations where if the only way to gain access to the OS supplied & maintained versions is to call external executables, then do that in preference to any home-brew or generic interfaced module. The OS supplied solution is far more likely to be tailored to the capabilities (and limitations) of the particular OS than any third party module.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.