Scatter in log-space: converting between dex and percentage

3 minute read

Published:

Those of us who work in astrophysics frequently analyze the logarithms of quantities of interest due to their large dynamic range. Thus, a lot of predictive modeling tasks that we work on aim to predict \(\log_{10}(X)\) rather than \(X\) itself. Two of the main summary statistics that we use to describe the predictive power of a model are the scatter and bias of the residuals. If working in log-space, the residuals are \(\mathcal{R} = \log_{10}(X_\mathrm{pred}) - \log_{10}(X_\mathrm{true})\). It’s quite common that the the scatter is then reported as the standard deviation of the residuals:

\[\sigma_\mathcal{R} = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (\mathcal{R}_i - \bar{\mathcal{R}})^2},\]

where typically the bias \(\bar{\mathcal{R}}=0\) and this scatter is in units of “dex”. Now, if your residual distribution is not normal, you may want to report the scatter in another way. For example, you could report half of the interval between the 16th and 84th percentile of \(\mathcal{R}\).

While we do much of our analysis with log-space quantities, it’s often more useful to report errors in terms of the fractional/percentage error about the true values. Let’s call the fractional error

\[\mathcal{E} = \frac{X_\mathrm{pred} - X_\mathrm{true}}{X_\mathrm{true}} = \frac{X_\mathrm{pred}}{X_\mathrm{true}} - 1 .\]

Hence, the “percentage scatter” would simply be the standard deviation of \(\mathcal{E}\), which we can call \(\sigma_\mathcal{E}\).

Here’s a useful trick that many people get confused about in the astrophysics literature. You can convert between scatter in dex and the percent scatter really easily by exploiting an approximation of the natural logarithm: \(\ln(x) \approx x-1\) for values of \(x\) close to unity. Let’s see how this works out.

Let’s start by computing the standard deviation of the ordinary residuals, assuming a negligible bias. We’ll call it \(\sigma_X\):

\[\sigma_X = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (X_\mathrm{pred} - X_\mathrm{true})^2} .\]

Now, we’re really interested in the scatter in the true quantity \(X\) while holding the independent variables fixed (which is the same as holding the prediction fixed), so we can compute \(\sigma_X / X = \sigma_X / X_\mathrm{pred}\):

\[\frac{\sigma_X}{X}= \sqrt{\frac{1}{N-1} \sum_{i=1}^N (\frac{X_\mathrm{pred} - X_\mathrm{true}}{X_\mathrm{pred}})^2}\]

Now, using the approximation of the natural logarithm:

\[\frac{\sigma_X}{X} \approx \sqrt{\frac{1}{N-1} \sum_{i=1}^N [\ln(X_\mathrm{pred} / X_\mathrm{true})]^2},\]

and using \(\ln(x) = \ln(10) \log_{10}(x)\), we see that

\[\frac{\sigma_X}{X} \approx \ln(10) \sqrt{\frac{1}{N-1} \sum_{i=1}^N [\log_{10}(X_\mathrm{pred} / X_\mathrm{true})]^2}.\]

Hence, we see that \(\sigma_\mathcal{E} = \sigma_X / X \approx \ln(10) \sigma_\mathcal{R}\).

In summary, you can convert from \(\log_{10}\) scatter in dex to the percentage/fractional scatter by simply multiplying by \(\ln(10)\), and the accuracy of this conversion only depends on the accuracy of the natural logarithm approximation. For scatter below 0.1 dex, this corresponds to an error of less than 4%.

This is really important for communicating results regarding intrinsic scatter in predictive models used in astronomy. Unfortunately, a lot of researchers have never sat down and worked through this simple math above to understand what the units of their reported results are. I’ve come across several papers in the past week while I was putting together the results section of a manuscript that clearly compute their scatter in dex, but then report it everywhere as a percentage scatter. When they go on to compare their results to other findings in the literature, they find that their value is about two times smaller than values found in other work. Little did they know that if they had just properly multiplied their result by \(\ln(10)\), they would have found that their result agreed with other findings in the literature! It’s amazing that this gets past the peer review process, because I’ve stumbled across poor unit conversions for scatter like this too many times. Hence the blog article.