Processing of big datasets

Topic: Processing of big datasets (Read 3498 times) previous topic - next topic

Processing of big datasets

August 17, 2020, 09:30:13 AM

Hello, community!

I have almost managed to push a set of 385 Q-ToF measurements (around 200 Gb of centroid data :-) through MS-DIAL v. 4.24. It has for now finished gap filling (required ca. 90h) and does something in limbo for another 24h. App is responding, something is being written on HDD from time to time and there is also some dynamic in RAM. But quite small one. Only one-two cores of 16 are involved at 100% load from time to time. Don't know when it'll be done, but already excited to see the results.

In this run one of our in-house databases had been used for metabolite identification. However, we got another idea and would like to check with another database. Is there a possibility to go through another post-ident run on existing ion table without experiencing a whole sequence of peak picking / alignment / gap-filling / finalization again?

IIRC MS-DIAL always does this when new alignment instance is created. And it would take another week again, i guess, if there is no way around it. Another important thing also arises - let's say i suddenly got power shortage during processing in gap filling stage. Is there some possibility to start from the point where everything stopped?

Re: Processing of big datasets

Reply #1 – August 18, 2020, 12:40:01 AM

Hi Sergey,
check the "temp" folders. You will find small alignment files being processed during the "limbo time", most probably. I would suggest to explore a small subset of your files to get the real impression what is worth the time, and optimize the parameters. E.g. when I tried a lower threshold (20,000 -> 1,000) with ~700 abf files, the gain was ~24,000 -> 27,000 peak spots for extra 15 hours of calculations (1.5 -> 16.5 h). However, many of the extra peaks were within the noise or didn't pass the subsequent quality check.

You can upload another library and skip the peak picking part, but the alignment and gap filling take the most of the processing time.

MS-DIAL will not recover your partially processed data, so plug in a UPS if you are afraid of the power failure.
Best,
O.

Re: Processing of big datasets

Reply #2 – August 18, 2020, 01:37:11 AM

Hi, Ondrej!

> I would suggest to explore a small subset of your files to get the real impression what is worth the time, and optimize the parameters. E.g. when I tried a lower threshold (20,000 -> 1,000) with ~700 abf files, the gain was ~24,000 -> 27,000 peak spots for extra 15 hours of calculations (1.5 -> 16.5 h).

Already done exactly like this. This processing session is as optimized as possible. And i need the guys deep in the forest for these experiments. Threshold is under 1E3 counts.

> but the alignment and gap filling take the most of the processing time

Let's wait for new MS-DIAL architecture -)

> so plug in a UPS if you are afraid of the power failure

It was a constructed situation. I, personally, have everything and more.

> You will find small alignment files being processed during the "limbo time", most probably

Which exactly temp folder? Peaklists from gap filling are stored in separate folder within the project location. Resource monitor only showed handles for pagefile and DLL's afterwards. Apparently, the steps after gap filling are going through RAM. There is "Gap filling: 385/385" now and stays like this. 2 cores of 16 are regularly at 100% load, around 140 Gb of memory are still commited to MS-DIAL and at some point around 800 Mb RAM block becomes modified and then goes again to the "in use" state.

It simply wonders me what is then calculated, which takes more time than gap filling. In my experience it was always the longest processing step.

Best,
SG

Re: Processing of big datasets

Reply #3 – October 09, 2020, 05:26:04 AM

The TMP folder within the project folder. Usually something like
project_2020_9_9_12_33_13_tmpFolder
...
peaklist_101.pll
peaklist_100.pll
peaklist_10.pll
peaklist_1.pll
peaklist_0.pll
...

Re: Processing of big datasets

Reply #4 – October 12, 2020, 06:01:28 AM

Hi, Ondrej!

Problem solved, actually. It was the ToF data itself, after some preprocessing it was possible to get to results in a reasonable time. But still, MS-DIAL in its current form is, unfortunately, not that much capable to deal with these profiles and associated data volumes.