Sunday, June 13, 2021

Solution for QSoas quiz #2: averaging several Y values for the same X value

This post describes two similar solutions to the Quiz #2, using the data files found there. The two solutions described here rely on split-on-values. The first solution is the one that came naturally to me, and is by far the most general and extensible, but the second one is shorter, and doesn't require external script files.

Solution #1

The key to both solution is to separate the original data into a series of datasets that only contain data at a fixed value of x (which corresponds here to a fixed pH), and then process each dataset one by one to extract the average and standard deviation. This first step is done thus:
QSoas> load kcat-vs-ph.dat
QSoas> split-on-values pH x /flags=data
After these commands, the stacks contains a series of datasets bearing the data flag, that each contain a single column of data, as can be seen from the beginnings of a show-stack command:
QSoas> k
Normal stack:
	 F  C	Rows	Segs	Name	
#0	(*) 1	43	1	'kcat-vs-ph_subset_22.dat'
#1	(*) 1	44	1	'kcat-vs-ph_subset_21.dat'
#2	(*) 1	43	1	'kcat-vs-ph_subset_20.dat'
...
Each of these datasets have a meta-data named pH whose value is the original x value from kcat-vs-ph.dat. Now, the idea is to run a stats command on the resulting datasets, extracting the average value of x and its standard deviation, together with the value of the meta pH. The most natural and general way to do this is to use run-for-datasets, using the following script file (named process-one.cmds):
stats /meta=pH /output=true /stats=x_average,x_stddev
So the command looks like:
QSoas> run-for-datasets process-one.cmds flagged:data
This command produces an output file containing, for each flagged dataset, a line containing x_average, x_stddev, and pH. Then, it is just a matter of loading the output file and shuffling the columns in the right order to get the data in the form asked. Overall, this looks like this:
l kcat-vs-ph.dat
split-on-values pH x /flags=data
output result.dat /overwrite=true
run-for-datasets process-one.cmds flagged:data
l result.dat
apply-formula tmp=y2;y2=y;y=x;x=tmp
dataset-options /yerrors=y2
The slight improvement over what is described above is the use of the output command to write the output to a dedicated file (here result.dat), instead of out.dat and ensuring it is overwritten, so that no data remains from previous runs.

Solution #2

The second solution is almost the same as the first one, with two improvements:
  • the stats command can work with datasets other than the current one, by supplying them to the /buffers= option, so that it is not necessary to use run-for-datasets;
  • the use of the output file can by replaced by the use of the accumulator.
This yields the following, smaller, solution:
l kcat-vs-ph.dat
split-on-values pH x /flags=data
stats /meta=pH /accumulate=* /stats=x_average,x_stddev /buffers=flagged:data
pop
apply-formula tmp=y2;y2=y;y=x;x=tmp
dataset-options /yerrors=y2


About QSoas

QSoas is a powerful open source data analysis program that focuses on flexibility and powerful fitting capacities. It is released under the GNU General Public License. It is described in Fourmond, Anal. Chem., 2016, 88 (10), pp 5050–5052. Current version is 3.0. You can download its source code there (or clone from the GitHub repository) and compile it yourself, or buy precompiled versions for MacOS and Windows there.