Many. You don't give us much to go on. Caching at some level is most likely - the first 20% builds the cache and the remainder gains the benefit.
It should be fairly trivial to slow the fast bit down by checking loop iteration time and putting sleeps in as required.
Perl reduces RSI - it saves typing
| [reply] |
| [reply] |
As others have said, you're not giving much away...
...however one can speculate.
- suppose the C stuff reads in the entire 1GB into some simple searchable structure whose leaves are lists of items. A common way of improving performance is to move items to the front of the list when found. Subsequent queries for the same thing will run faster.
- suppose the C stuff reads in the entire 1GB, but doesn't do much organisation of the data -- perhaps to minimise latency. Each query could do some opportunistic organisation, speeding up future queries.
- or, as above, but runs a background thread to organise the data.
- suppose the C stuff does not read in the entire 1GB, but reads parts of it as queries require, but keeps what it's read.
- suppose the C stuff simply maps the 1GB into VM, and lets the OS load stuff on demand,
- or, as above, but runs a background thread to read and organise the data.
these are examples of forms of cacheing as suggested by GrandFather. It's also possible that:
- the processing of a query triggers a lot of memory allocation/deallocation/garbage collection activity, but this settles down after a few queries.
- although you have plenty of real memory, it still takes a while to build up to the full working set.
- the disc is struggling to read the data, but once its in memory, you're fine.
so much for speculation.
Mind you, you say that things are slow to start with, no matter whether the input you're querying with is 20 lines or 1 million lines... and no matter how many lines, it's very slow for the first 20% of them ? Everything I can think of I would expect to speed up either as more queries are made, or after a period related to the size of the data. Neither of these are proportional to the number of lines queries.
You're sure it's not data dependent ?
Long story, made short: yes one can imagine ways that a program along the lines you describe can speed up over time -- but more information is required to diagnose why in this case !
| [reply] |
Thanks everyone for your suggestions. (I am the poster—thought I was logged in).
I know that the data structure is static once it's loaded since I wrote the C library, and that the Perl script is parsing only one line of data at a time. That's why this is really confusing me.
| [reply] |
I don't think it directly relates to your problem, but it reminds me of Strings and numbers: losing memory and mind. In that case, my large data structure started out as strings and was converted to numbers on the fly as I used it. That caused a lot more memory usage after the point where I thought it would be static. Maybe your data structure is experiencing a similar "one time cost" as it's accessed.
| [reply] |
Thanks for everyone's suggestions. At least I have some ideas now of why my loops might speed up. Now I think that in this particular situation the slow-down is somehow in the C library, and not Perl's or Inline's fault. I am now re-implementing it in pure Perl---I think this will be faster for me than figure out what's going wrong in C. | [reply] |