Calculation of exponential trend with occasional zero values

DavidL1 · September 6, 2020, 2:11am

I originally tried LOGEST() but couldn’t properly relate the CALC documentation to my spreadsheet content. However I had another look following your comment and, interestingly, it doesn’t show the same ERR:502 problem, at least for ‘a’.

If y = ae^bx I can identify the correct value for the exponent ‘b’ in the array returned by LOGEST() and a plausible value for RSQ(), but I cannot see how to calculate ‘a’. And this assumes the exponential function should be recast as

    ln(y) = ln(a) + bx*ln(e)   or  ln(y) = ln(a) + bx

The values for ‘a’ & ‘b’ returned by the formulae I actually used agree exactly with those returned by the exponential-trend option of the chart. SLOPE() is necessary for calculating ‘b’ but not ‘a’ and I think I only quoted the former.

Thanks for your interest…

DavidL

DavidL1 · September 10, 2020, 2:24am

Several points…

LOGEST() does seem to have the same ERR:502 problem when the Y-data includes zero values.
In my view the LOGEST() documentation is inadequate to the point of being misleading, it fails to properly explain how the function works or is used, and it should be rewritten.
In the example above where y = ae^bx, if the array returned by LOGEST() is A1:C5 then the value of ‘a’ is evidently returned in A2 and the value of ‘b’ is the natural log of the value in A1.

Occasional zero values in observational data are certainly not uncommon, and I think an exponential trend calculation should not fall on its face when it strikes one!

DavidL.

Lupp · September 10, 2020, 9:40am

Quoting @DavidL1: “In my view the LOGEST() documentation is inadequate…”
What “LOGEST() documentation” do you refer to here? Please give a link.

The relevant documentation of Calc functions is found in: https://docs.oasis-open.org/office/OpenDocument/v1.3/cs01/part4-formula/OpenDocument-v1.3-cs01-part4-formula.odt (also available in html and in pdf). There you find subchapter 6.18.42 LOGEST.
Quoting from that: “If any of the entries in KnownY and KnownX do not convert to Number or if any of the entries in KnownY is negative, LOGEST returns an error.”
You’re right, imo, insofar as “negative” should read “not positive”.

Quoting @DavidL1: “Occasional zero values in observational data are certainly not uncommon, …”
… and it’s the duty of the user to define how they shopuld be handled.
The way it’s done for exponential trendlines in Charts is obviously doubtable. However, Chart objects aren’t as precisely defined as Calc functions. You decide.

DavidL1 · September 10, 2020, 12:48pm

Thanks, Wolfgang, for the reference to the Oasis document. It’s getting late here now but I’ll read it with interest tomorrow as it’s far more informative than the LibreOffice offline-help included in the SuSE Linux distribution, which I termed “inadequate”.

LOGEST() probably uses logarithms, and the log of a zero or negative number will certainly fail. The Oasis documentation should probably say KnownY values must be >0, but the LibreOffice offline-help doesn’t mention any constraints at all. It’s difficult to decide how to treat quirks in the data when no functional constraints are mentioned!

In any case, it would help if LOGEST() popped up something more informative than ERR:502. I discovered the zero KnownY value caused the problem for myself.

Finally, all the problems I’m reporting occur with CALC functions. I’m leaving the chart utilities out of it.

Lupp · September 10, 2020, 11:49am

In his comment “Occasional zero values in observational data are certainly not uncommon, and I think an exponential trend calculation should not fall on its face when it strikes one!” the questioner expresses an expectation which cannot be implemented by software.

Since I feel this expectation to be dangerous, and also to be the background of actually occurring misinterpretations of data disguised as calculated results, I post this as an answer with an attached example. It may not be seen as an answer to the original question.

Occasional errors in reported data are certainly not uncommon, and if an intended evaluation is specifically sensitive concerning some kinds of expectable errors, we need to thoroughly discuss the case. Software -and in particular spreadsheets- cannot do that for us.

Anyway: Applying mathematical methods requires attention to the limitations.
As we can see next to every day when studying “visualizations” in the media, we need to cultivate our doubts.

ask263906alternatives.ods

DavidL1 · September 12, 2020, 1:42am

Having looked at the Oasis document mentioned above by Wolfgang (*) which describes LOGEST, I can summarise as follows:

According to the Oasis specification, LOGEST should be able to handle zero values, see Sc. 6.18.42 under semantics “If any of the entries in KnownY and KnownX do not convert to Number or if any of the entries in KnownY is negative, LOGEST returns an error.” So either the document is incorrect or there’s a bug in the implementation.
The LibreOffice Help for LOGEST() Array Functions is woefully poor. There’s no mention of any constraint on Y-values, the structure of the returned array is left to imagination, and the function it describes bears little resemblence to that described in the Oasis document. (My guess that the value of ‘b’ in y=ae^bx is the natural log of the value in the top-left cell of the returned array turns out to be correct.)

We could have a discussion about whether it’s reasonable to reject zero Y-values (assuming the documentation is incorrect). The spreadsheet in question has two options for calculation of infection rate; one calculates over the 24-hour reporting period and the zero value occurs because the reported total of covid-19 cases was the same two days running (29th & 30th April, when the initial spread of infection had almost completely decayed).

I can’t immediately see an elegant way of resolving this issue which doesn’t involve either fiddling the data (e.g. by adding 0.1) or going beyond the capability of a spreadsheet.

Many thanks to Wolfgang for his involvement, especially in pointing out the existence of the Oasis specification.

“Open Document Format for Office Applications (OpenDocument) Version 1.3. Part 4: Recalculated Formula (OpenFormula) Format, Committee Specification 01, 25 December 2019”
https://docs.oasis-open.org/office/OpenDocument/v1.3/cs01/part4-formula/OpenDocument-v1.3-cs01-part4-formula.odt

Lupp · September 12, 2020, 9:37am

Yes. Also specifications have flaws sometimes. The mentioned “6.18.42 LOGESTR” has - insofar as it uses the term “negative” where “not positive” is required. A frequent schoolboy’s mistake.
However, this is only the surface. As the calculations for chart trendlines do, we might ignore data rows causing errors (LOG(0) type in this case). But this will/would mislead the user in many cases distracting his attention from a fundamental flaw in the data needing thorough analysis.
Not only zero values, however, but also rapid changes (by factors >2 e.g.) would inevitably require thorough analysis and considerations concerning the question if the used model for data interpretation and smoothening is actually applicable.
Expecting a standard function/tool to dispense from this requirement is the wrong way.
There is already enough nonsense in the world copycatting mathematical/statistical results. No need to raise efficiency insofar.

Lupp · September 12, 2020, 9:45am

(Continued)
Did you, e.g. replace the critical values in B7; B13 with 0.01? What does the change in the trendline tell you?

DavidL1 · September 14, 2020, 12:48am

That’s an interesting question.

I mentioned that the spreadsheet has two options for calculation of infection rate. Option-1 calculates each day’s infection rate over the 24-hour reporting period for that day (and the zero value occurs because the cumulative number of cases was the same two days running). Option-2 calculates each day’s infection rate by subtracting yesterday’s total from tomorrow’s total and dividing by the elapsed time (i.e. over two reporting periods), which reduces data noise of administrative origin.

Let the decay in infection rate after the first wave be represented by the trend
infectionRate = ae^bt
where ‘t’ is the number of days since 29th February, and the units of ‘b’ are “new cases per day per infectious, non-quarantined carrier”.

(continued below…)

DavidL1 · September 14, 2020, 12:51am

(a) Option-2: a = 9,545 b = -0.1155
(b) Option-1, but delete the row containing the zero value:
a = 11,465 b = -0.1193
(c) Option-1, but substitute the mean of the rate yesterday and tomorrow:
a = 9,373 b = -0.1143
(d) Option-1, but change the zero value to 0.01:
a = 20,420 b = -0.1336
(e) Option-1, but change the zero value to 0.1:
a = 16,129 b = -0.1278

The difference in exponents is quite small, even for (d), but I’m surprised the value of ‘a’ is so sensitive. And approximating zero with a very small value like 0.1 or 0.01 is not a good idea; I presume the effect of the abnormally small value is amplified by the least-aquares calculation.

DavidL1 · September 14, 2020, 1:19am

This forum package is very restrictive. Obviously new-lines aren’t actioned on comments, and I can’t see how to delete my previous two. I can email you privately if you’d like, Wolfgang, or begin a new "“solution”.

DavidL1 · September 14, 2020, 7:22am

The next attempt was no more successful.

Lupp · September 14, 2020, 12:39pm

(a) Option-2: a = 9,545 b = -0.1155
(b) Option-1, but delete the row containing the zero value: a = 11,465 b = -0.1193
(c) Option-1, but substitute the mean of the rate yesterday and tomorrow: a = 9,373 b = -0.1143
(d) Option-1, but change the zero value to 0.01: a = 20,420 b = -0.1336
(e) Option-1, but change the zero value to 0.1: a = 16,129 b = -0.1278

The difference in exponents is quite small, even for (d), but I’m surprised the value of ‘a’ is so sensitive. And approximating zero with a very small value like 0.1 or 0.01 is not a good idea; I presume the effect of the abnormally small value is amplified by the least-aquares calculation.
_
This was only to exemplify the usage of doublespace needed to make a subsequent line break work in this “editor”.
It’s the same with questions and answers, but, as opposed to comments, you see the result there while still editing.
For lists and the like you can also use the preformatted text option…

Lupp · September 14, 2020, 1:03pm

Quoting @DavidL1: “And approximating zero with a very small value like 0.1 or 0.01 is not a good idea;…”

My hint concerning the example attached to my answer should exactly show that “the idea” not just is “not good”, but needs to be explicitly rejected. It also isn’t an “approximation”.
My example was made to demonstrate that any automated handling of zeros with (explicit or implicit) applications to LOGEST() must inevitably fail. The only attempt having a faint justification (kind of) is to exclude zero values from any contribution to the evaluation as Chart trendlines do.

This is also bad, but the errors it causes may be expected to be not too serious in some everyday cases.

The only way to do it right is to thoroughly study the situation, and then do program the needed smoothing consciously regarding the specifics of the case - and to dismiss the exponential model if it shows to not be actually applicable.

Lupp · September 14, 2020, 1:28pm

Quoting @DavidL1: “I presume the effect of the abnormally small value is amplified by the least-squares calculation.”

If you study the specifications of LOGEST() [6.18.42] and also of LINEST() [6.18.41] as well, you will find that LOGEST basically is LINEST applied to the logarithms of the dependent data.

I neither worked on it myself nor resaerched related publications concerning the question if this is justifiable, and what conditions might need to be checked before application.

If you do it, and find a way to get better (wider applicable) calculations of coefficients for exponential trends, and also to report a kind of WarningIndex concerning risky cases, you may file an enhancement bug with the respective proposal for the specification of a LOGEST.NEW() function.

Lupp · September 14, 2020, 1:40pm

To use the least-sum-of-squared-deviations as the criterion for optimization of approximations is not a dogm or proven the only reasonable way, but a well proven way to get usable solutions, and also the best studied way (conditions, consequences, algorithms…).
To use it the way LOGEST() does, results in replacing the differences between data and values of the approximation by their ratios (in a sense). If you see a need to use the differences explicitly again: Just try it.
To get solutions based on a different rating concerning optimization, you may experiment with the non-linear solver.

DavidL1 · September 15, 2020, 12:59am

Lupp stated: “My example was made to demonstrate that any automated handling of zeros with (explicit or implicit) applications to LOGEST() must inevitably fail. The only attempt having a faint justification (kind of) is to exclude zero values from any contribution to the evaluation as Chart trendlines do.”

I think we’ve pretty well dealth with this topic. I certainly agree that any manipulation of the data by any function should be avoided unless it’s inherent in the particular function. But I’m interested to hear your mention of Chart trends ignoring zeroes; I knew they ignored blank entries, but are explicit zero values also ignored?

(continued…)

DavidL1 · September 15, 2020, 1:04am

Further, the spreadsheet in question has two exponential trendlines, one describing the rise of the initial infection and one its’ decay. The values of ‘a’ & ‘b’ in the exponentials (ae^bt) as calculated by the chart agree exactly with those calculated by LOGEST(). However the chart displays the rather improbable value 1.0000 for R-squared in both cases, whereas LOGEST() returns 0.9655 and 0.8781.

Any comment?

Lupp · September 15, 2020, 11:06am

Quoting @DavidL1: “I knew they ignored blank entries, but are explicit zero values also ignored?”
There is no specification I would know of telling me how the Chart module of LibreOffice should work in every case (and there are lots!). Thus I can only judge from my experiences, but I would not assume the X-Y-scatter-exponential-trendline-feature explicitly excludes zero values. As I interpret the observations, the used routine excludes rows producing errors from the evaluation.
Quoting @DavidL1: “Any {further} comment?”
No. But there’s a question: Did you meanwile attach your example somewhere?

DavidL1 · September 15, 2020, 12:56pm

My apologies Lupp… this spreadsheet has been through many revisions as various things were investigated, and the reason for the discrepancy in R-squared values came to me out of the blue a few hours ago. The bug has been fixed.
Thanks again for your help!