Calculation of exponential trend with occasional zero values

DavidL1 · September 3, 2020, 6:25am

I’m using a spreadsheet formula for calculation of an exponential regression:
=EXP(INTERCEPT(LN(OFFSET(Data.$H$5,ph2StartIndex,0,ph2EndIndex-ph2StartIndex+1,1)),OFFSET(Data.$D$5,ph2StartIndex,0,ph2EndIndex-ph2StartIndex+1,1)))

but when stripped of the book-keeping it’s just the standard array formula given in HELP:
=EXP(INTERCEPT(LN(Yarray,Xarray)))

The problem is that a Y-variable can sometimes go to zero, which causes an ERR:502 exception because LN(0) is undefined. Since the Y-variables are themselves calculated, I could change 0 to 0.1 wherever it occurs, but that smacks of fiddling with the data.

Can anyone suggest a more elegant way of fixing the problem? Could the entries containing (y,x=0,x) be ignored by including a zero test in the above array formula? I see the built-in trend calculation doesn’t have this problem.

DavidL

Zizi64 · September 3, 2020, 8:06am

Please upload a real, ODF type sample file here

Lupp · September 3, 2020, 11:08am

I could change 0 to 0.1 wherever it occurs, but that smacks of fiddling with the data.

It not only smacks of fiddling, but is.
-1- If the calculation actually produces values crashing the further evaluation, it must be fixed based on the underlying model. If this isn’t possible, the model needs to be abandoned.
-2- If the zero values are signalling an unavoidable problem like accidental loss of a data point due to a logging error e.g, the respective data sets must be eliminated in advance of the evaluation of the series.

See also: ask263906exponentialTrendLineParameters_0.ods

DavidL1 · September 4, 2020, 7:35am

The basic problem is simple enough. The Commonwealth Dept. of Health here publishes (among other stuff) the cumulative daily total of Covid-19 cases, and I want the infection rate per day. (Note, this is a purely private exercise!) That involves subtracting the total yesterday from tomorrow’s total and dividing by the elapsed time.

The elapsed time is never zero, but the number of cases in that time period may be. Hence the occasional zeroes in calculated infection rate which cause the Err:502 since LN(0) is invalid.

Here is the formula which fails:

=EXP(INTERCEPT(LN(OFFSET(Data.$H$5,ph2StartIndex,0,ph2EndIndex-ph2StartIndex+1,1)),OFFSET(Data.$D$5,ph2StartIndex,0,ph2EndIndex-ph2StartIndex+1,1)))

I can post more if needed, but It’s a big spreadsheet and probably wouldn’t add anything to the above.

DavidL

Lupp · September 4, 2020, 8:46am

There was a faint spuspicion anyway the poblem would be in exacly this context.
What I already posted is applicable (case -2-). I didn’t study your formula in detail, however, because I would prefer to use LOGEST(). The problem how to exclude “bad” data should be the same, however, as with INTERCEPT(). BTW: Why didn’t you need SLOPE()?
I will now be off at least for 11 hours. Will probably write again later.

DavidL1 · September 6, 2020, 2:11am

I originally tried LOGEST() but couldn’t properly relate the CALC documentation to my spreadsheet content. However I had another look following your comment and, interestingly, it doesn’t show the same ERR:502 problem, at least for ‘a’.

If y = ae^bx I can identify the correct value for the exponent ‘b’ in the array returned by LOGEST() and a plausible value for RSQ(), but I cannot see how to calculate ‘a’. And this assumes the exponential function should be recast as

    ln(y) = ln(a) + bx*ln(e)   or  ln(y) = ln(a) + bx

The values for ‘a’ & ‘b’ returned by the formulae I actually used agree exactly with those returned by the exponential-trend option of the chart. SLOPE() is necessary for calculating ‘b’ but not ‘a’ and I think I only quoted the former.

Thanks for your interest…

DavidL

DavidL1 · September 10, 2020, 2:24am

Several points…

LOGEST() does seem to have the same ERR:502 problem when the Y-data includes zero values.
In my view the LOGEST() documentation is inadequate to the point of being misleading, it fails to properly explain how the function works or is used, and it should be rewritten.
In the example above where y = ae^bx, if the array returned by LOGEST() is A1:C5 then the value of ‘a’ is evidently returned in A2 and the value of ‘b’ is the natural log of the value in A1.

Occasional zero values in observational data are certainly not uncommon, and I think an exponential trend calculation should not fall on its face when it strikes one!

DavidL.

Lupp · September 10, 2020, 9:40am

Quoting @DavidL1: “In my view the LOGEST() documentation is inadequate…”
What “LOGEST() documentation” do you refer to here? Please give a link.

The relevant documentation of Calc functions is found in: https://docs.oasis-open.org/office/OpenDocument/v1.3/cs01/part4-formula/OpenDocument-v1.3-cs01-part4-formula.odt (also available in html and in pdf). There you find subchapter 6.18.42 LOGEST.
Quoting from that: “If any of the entries in KnownY and KnownX do not convert to Number or if any of the entries in KnownY is negative, LOGEST returns an error.”
You’re right, imo, insofar as “negative” should read “not positive”.

Quoting @DavidL1: “Occasional zero values in observational data are certainly not uncommon, …”
… and it’s the duty of the user to define how they shopuld be handled.
The way it’s done for exponential trendlines in Charts is obviously doubtable. However, Chart objects aren’t as precisely defined as Calc functions. You decide.

DavidL1 · September 10, 2020, 12:48pm

Thanks, Wolfgang, for the reference to the Oasis document. It’s getting late here now but I’ll read it with interest tomorrow as it’s far more informative than the LibreOffice offline-help included in the SuSE Linux distribution, which I termed “inadequate”.

LOGEST() probably uses logarithms, and the log of a zero or negative number will certainly fail. The Oasis documentation should probably say KnownY values must be >0, but the LibreOffice offline-help doesn’t mention any constraints at all. It’s difficult to decide how to treat quirks in the data when no functional constraints are mentioned!

In any case, it would help if LOGEST() popped up something more informative than ERR:502. I discovered the zero KnownY value caused the problem for myself.

Finally, all the problems I’m reporting occur with CALC functions. I’m leaving the chart utilities out of it.

Lupp · September 10, 2020, 11:49am

In his comment “Occasional zero values in observational data are certainly not uncommon, and I think an exponential trend calculation should not fall on its face when it strikes one!” the questioner expresses an expectation which cannot be implemented by software.

Since I feel this expectation to be dangerous, and also to be the background of actually occurring misinterpretations of data disguised as calculated results, I post this as an answer with an attached example. It may not be seen as an answer to the original question.

Occasional errors in reported data are certainly not uncommon, and if an intended evaluation is specifically sensitive concerning some kinds of expectable errors, we need to thoroughly discuss the case. Software -and in particular spreadsheets- cannot do that for us.

Anyway: Applying mathematical methods requires attention to the limitations.
As we can see next to every day when studying “visualizations” in the media, we need to cultivate our doubts.

ask263906alternatives.ods

DavidL1 · September 12, 2020, 1:42am

Having looked at the Oasis document mentioned above by Wolfgang (*) which describes LOGEST, I can summarise as follows:

According to the Oasis specification, LOGEST should be able to handle zero values, see Sc. 6.18.42 under semantics “If any of the entries in KnownY and KnownX do not convert to Number or if any of the entries in KnownY is negative, LOGEST returns an error.” So either the document is incorrect or there’s a bug in the implementation.
The LibreOffice Help for LOGEST() Array Functions is woefully poor. There’s no mention of any constraint on Y-values, the structure of the returned array is left to imagination, and the function it describes bears little resemblence to that described in the Oasis document. (My guess that the value of ‘b’ in y=ae^bx is the natural log of the value in the top-left cell of the returned array turns out to be correct.)

We could have a discussion about whether it’s reasonable to reject zero Y-values (assuming the documentation is incorrect). The spreadsheet in question has two options for calculation of infection rate; one calculates over the 24-hour reporting period and the zero value occurs because the reported total of covid-19 cases was the same two days running (29th & 30th April, when the initial spread of infection had almost completely decayed).

I can’t immediately see an elegant way of resolving this issue which doesn’t involve either fiddling the data (e.g. by adding 0.1) or going beyond the capability of a spreadsheet.

Many thanks to Wolfgang for his involvement, especially in pointing out the existence of the Oasis specification.

“Open Document Format for Office Applications (OpenDocument) Version 1.3. Part 4: Recalculated Formula (OpenFormula) Format, Committee Specification 01, 25 December 2019”
https://docs.oasis-open.org/office/OpenDocument/v1.3/cs01/part4-formula/OpenDocument-v1.3-cs01-part4-formula.odt

Lupp · September 12, 2020, 9:37am

Yes. Also specifications have flaws sometimes. The mentioned “6.18.42 LOGESTR” has - insofar as it uses the term “negative” where “not positive” is required. A frequent schoolboy’s mistake.
However, this is only the surface. As the calculations for chart trendlines do, we might ignore data rows causing errors (LOG(0) type in this case). But this will/would mislead the user in many cases distracting his attention from a fundamental flaw in the data needing thorough analysis.
Not only zero values, however, but also rapid changes (by factors >2 e.g.) would inevitably require thorough analysis and considerations concerning the question if the used model for data interpretation and smoothening is actually applicable.
Expecting a standard function/tool to dispense from this requirement is the wrong way.
There is already enough nonsense in the world copycatting mathematical/statistical results. No need to raise efficiency insofar.

Lupp · September 12, 2020, 9:45am

(Continued)
Did you, e.g. replace the critical values in B7; B13 with 0.01? What does the change in the trendline tell you?

DavidL1 · September 14, 2020, 12:48am

That’s an interesting question.

I mentioned that the spreadsheet has two options for calculation of infection rate. Option-1 calculates each day’s infection rate over the 24-hour reporting period for that day (and the zero value occurs because the cumulative number of cases was the same two days running). Option-2 calculates each day’s infection rate by subtracting yesterday’s total from tomorrow’s total and dividing by the elapsed time (i.e. over two reporting periods), which reduces data noise of administrative origin.

Let the decay in infection rate after the first wave be represented by the trend
infectionRate = ae^bt
where ‘t’ is the number of days since 29th February, and the units of ‘b’ are “new cases per day per infectious, non-quarantined carrier”.

(continued below…)

DavidL1 · September 14, 2020, 12:51am

(a) Option-2: a = 9,545 b = -0.1155
(b) Option-1, but delete the row containing the zero value:
a = 11,465 b = -0.1193
(c) Option-1, but substitute the mean of the rate yesterday and tomorrow:
a = 9,373 b = -0.1143
(d) Option-1, but change the zero value to 0.01:
a = 20,420 b = -0.1336
(e) Option-1, but change the zero value to 0.1:
a = 16,129 b = -0.1278

The difference in exponents is quite small, even for (d), but I’m surprised the value of ‘a’ is so sensitive. And approximating zero with a very small value like 0.1 or 0.01 is not a good idea; I presume the effect of the abnormally small value is amplified by the least-aquares calculation.

DavidL1 · September 14, 2020, 1:19am

This forum package is very restrictive. Obviously new-lines aren’t actioned on comments, and I can’t see how to delete my previous two. I can email you privately if you’d like, Wolfgang, or begin a new "“solution”.

DavidL1 · September 14, 2020, 7:22am

The next attempt was no more successful.

Lupp · September 14, 2020, 12:39pm

(a) Option-2: a = 9,545 b = -0.1155
(b) Option-1, but delete the row containing the zero value: a = 11,465 b = -0.1193
(c) Option-1, but substitute the mean of the rate yesterday and tomorrow: a = 9,373 b = -0.1143
(d) Option-1, but change the zero value to 0.01: a = 20,420 b = -0.1336
(e) Option-1, but change the zero value to 0.1: a = 16,129 b = -0.1278

The difference in exponents is quite small, even for (d), but I’m surprised the value of ‘a’ is so sensitive. And approximating zero with a very small value like 0.1 or 0.01 is not a good idea; I presume the effect of the abnormally small value is amplified by the least-aquares calculation.
_
This was only to exemplify the usage of doublespace needed to make a subsequent line break work in this “editor”.
It’s the same with questions and answers, but, as opposed to comments, you see the result there while still editing.
For lists and the like you can also use the preformatted text option…

Lupp · September 14, 2020, 1:03pm

Quoting @DavidL1: “And approximating zero with a very small value like 0.1 or 0.01 is not a good idea;…”

My hint concerning the example attached to my answer should exactly show that “the idea” not just is “not good”, but needs to be explicitly rejected. It also isn’t an “approximation”.
My example was made to demonstrate that any automated handling of zeros with (explicit or implicit) applications to LOGEST() must inevitably fail. The only attempt having a faint justification (kind of) is to exclude zero values from any contribution to the evaluation as Chart trendlines do.

This is also bad, but the errors it causes may be expected to be not too serious in some everyday cases.

The only way to do it right is to thoroughly study the situation, and then do program the needed smoothing consciously regarding the specifics of the case - and to dismiss the exponential model if it shows to not be actually applicable.

Lupp · September 14, 2020, 1:28pm

Quoting @DavidL1: “I presume the effect of the abnormally small value is amplified by the least-squares calculation.”

If you study the specifications of LOGEST() [6.18.42] and also of LINEST() [6.18.41] as well, you will find that LOGEST basically is LINEST applied to the logarithms of the dependent data.

I neither worked on it myself nor resaerched related publications concerning the question if this is justifiable, and what conditions might need to be checked before application.

If you do it, and find a way to get better (wider applicable) calculations of coefficients for exponential trends, and also to report a kind of WarningIndex concerning risky cases, you may file an enhancement bug with the respective proposal for the specification of a LOGEST.NEW() function.