As others have said, you're not giving much away...
...however one can speculate.
- suppose the C stuff reads in the entire 1GB into some simple searchable structure whose leaves are lists of items. A common way of improving performance is to move items to the front of the list when found. Subsequent queries for the same thing will run faster.
- suppose the C stuff reads in the entire 1GB, but doesn't do much organisation of the data -- perhaps to minimise latency. Each query could do some opportunistic organisation, speeding up future queries.
- or, as above, but runs a background thread to organise the data.
- suppose the C stuff does not read in the entire 1GB, but reads parts of it as queries require, but keeps what it's read.
- suppose the C stuff simply maps the 1GB into VM, and lets the OS load stuff on demand,
- or, as above, but runs a background thread to read and organise the data.
these are examples of forms of cacheing as suggested by
GrandFather. It's also possible that:
- the processing of a query triggers a lot of memory allocation/deallocation/garbage collection activity, but this settles down after a few queries.
- although you have plenty of real memory, it still takes a while to build up to the full working set.
- the disc is struggling to read the data, but once its in memory, you're fine.
so much for speculation.
Mind you, you say that things are slow to start with, no matter whether the input you're querying with is 20 lines or 1 million lines... and no matter how many lines, it's very slow for the first 20% of them ? Everything I can think of I would expect to speed up either as more queries are made, or after a period related to the size of the data. Neither of these are proportional to the number of lines queries.
You're sure it's not data dependent ?
Long story, made short: yes one can imagine ways that a program along the lines you describe can speed up over time -- but more information is required to diagnose why in this case !