Calculate average of last x values that meet a condition

Kris_Kelley · April 8, 2022, 8:12pm

I’ve got a formula I use to calculate the average of all the values where the name matches another column.

AVERAGEIF($C$2:$C$48383,F2,$D$2:$D$48383)

Column C is the name every time it appears, and column D is the value associated with that occurrence. Column F contains the names, but only once.

How do I expand this formula to restrict the values used to only the last ten values, instead of all values that match? The table is in order, and there are no blanks. I think I need to use AVERAGEIFS, LARGE, and ROW, but I can’t get anything to work.

Zizi64 · April 9, 2022, 6:03am

Can you upload an ODF type sample file here?

Kris_Kelley · April 11, 2022, 8:50pm

Would this be what you are looking for?
ODF Sample.ods (669.8 KB)
I’ve got a full average in column F, but I’m trying to convert that to a moving average, just the last ten that matches the name. I realize now that not every name would have ten results, so the formula may need to incorporate that possibility… Of course I could just use an IF in another column, so that’s not necessary.

joshua4 · April 12, 2022, 12:51am

Can you clarify a little more. Are we to think of C and D (in the example sheet) as growing indefinitely and E and F as accumulators, each showing the average of the (at most) last ten entries of that label in C and D from the bottom up?

The way I see it, E&F have no particular reason to be beside C&D. E&F are just the “report” for the (moving) averages from C&D going up as far as ten occurrences of a given label (in C) from the bottom of C&D. Is this correct?

If correct, you could start with the following, perhaps, which will calculate the average of the largest up-to-ten values. You would need to modify for last up-to-10 using, as you mention, ROW. I might be able to look into this tomorrow again.

The idea here is to filter in an array formula (ctrl+shift+Enter).

=SUM(IFERROR((LARGE(IF(Entries=$F3,OFFSET(Entries,0,1),"~"),{1;2;3;4;5;6;7;8;9;10})),"~"))/COUNTIF(IFERROR((LARGE(IF(Entries=$F3,OFFSET(Entries,0,1),"~"),{1;2;3;4;5;6;7;8;9;10})),"~"),"<>~")

It’s really the same thing twice, just SUM/COUNT for average.

Kris_Kelley · April 12, 2022, 4:31pm

You would be correct in your assumption. E was actually generated from C as a non-duplicate column. It looks like this, partially, because I stripped unnecessary info (there was a date/time field in column A, etc.)

joshua4 · April 12, 2022, 3:43am

I think this does what you want…it seems to do what I expected. You might look it over and test it out.

It could be generalized by using the standard tricks for getting a sequence in an array formula (instead of using {1;…10}), and it might be optimized by naming the Entry Values separately rather than using OFFSET from Entries2.

It’s not especially snappy!

RollingAverageAccumulator.ods (633.2 KB)

Villeroy · April 12, 2022, 6:58pm

Well, of course this set operation is fairly easy to do and easier to understand in the context of a database. And a database does not need any maintainance once the tables, indices and queries are set up properly.

joshua4 · April 12, 2022, 10:32pm

I’m glad you looked at this thread, @Villeroy. I was going to ask, if you had a table with the fields:

Unique_ID, Timestamp, Entry_String, Entry_Value

what would be the SQL to do something like the question here…to get a report

Entry_String, Last_10_Avg

with each distinct string listed with its moving average aggregation?

Villeroy · April 13, 2022, 9:39am

I found many solutions to this problem on the internet. They all use some database engines that are not portable. It seems to be possible to do with LibreOffice’s “native database engine”, however a bit more complicated and less efficient. I’m working on this.
The first step to a lasting solution would be: Do not store row sets like this one in spreadsheets. Whenever you need such data in a spreadsheet, LO provides methods to make accessible any subset of database data so your formulas will be comprehensible because they deal with arithmetics instead of set operations.

Villeroy · April 13, 2022, 11:12pm

The uploaded database includes an embedded HSQL database, a form and a report.
I copied the 2 columns of item names and values as “TXT” and “VAL” into table “TBL” ignoring all the records where the value was text “#N/A”. That gives 55,158 records for 4,686 distinct item names. A 3rd column “ID” is basically an autommatically generated unique row number. I used that row number as “order of appearance”. Higher row number means “later”.

I copied the 33 calculated results from Joshua’s sheet for comparison.

Table “Filter” stores filter criteria for the form.

The database document weighs 2.7 MB because I added an index on item names “TXT” and values “VAL”. An index increases the lookup performance significantly. However, the database seems to hang for a few seconds when you close an object. This is because the whole database needs to be wrapped back into the document. A database connected to an external database does not show this lag.

The report “qLatest_All_Ag” is based on the equally named query. It dumps the averages and sums of the latest 10 records for each of the 4686 items to 92 print pages. This takes about 1 minute on my bread-and-butter laptop.

The form calculates the sum and average for one selected item at a time. The items can be filtered by a search text matching a part of the name or by a check box “Sheet Result”. When the check box is on, you get the 33 items with corresponding spreadsheet results. Check box off shows the other items. Check box in Null state ignores this criterion.
The filtered item names appear in the grey box.
The white box shows all corresponding values of the selected item.
The records in the 2 orange boxes should be identical. There is a simple method to select the latest 10 values of an item and there is a more complicated one. With HSQL only the complicated one can be aggregated. HSQL refuses to calculate any sum, average, minimum, maximum, deviation etc from the simple one. So the sum and average in the orange boxes are calculated from the second set. The grey box shows the looked up spreadsheet result (if any).

I failed to calculate the averages and sums for the highest 10 values of each item because the highest 10 values may include duplicates. Duplicates make it impossible to get the top 10 items ready for aggregation (sum, average etc.). There might be some different way (certainly there is with another DB engine) but I gave up on that problem for now.

It is very easy to append more pairs of name and value, with a most simple input form, directly to table “TBL” or by means of copy and paste (copy from spreadsheet, select “TBL” icon and paste). New name/value pairs appear instantly in the reloaded form and report without adjusting anything. You don’t even need to save anything. Everything is saved automatically when you give the database enough time for closing. For productive use, I would use another variant where the actual database is separated from the Base document.

Villeroy · April 14, 2022, 4:43pm

Same thing as an external HSQL database (version 2.4.1) including the driver: MovingAvg
Extract the zip into a trusted directory [1], open the database document MovingAvg.odb. A macro will connect the document with its database and driver.
The overall report loads in 30 seconds. No lag when closing database windows.

[1] trusted directory: Tools>Options>Security>[Macro Security…], tab “Trusted Sources”

joshua4 · April 15, 2022, 4:02pm

Thanks, @Villeroy. This will act as a true instructional for me. There are, in fact, almost no delays loading on my 2012 server-turned-desktop. Very handy way to distribute the database setup.
.
I’ve started to look at what all you’ve done, but it may be a few days before I get a chance to piece it all together. I did notice that you SELECTed a COUNT then used a HAVING on that COUNT. When I was trying to do it before asking I kept thinking I’d have a HAVING and a LIMIT after a sort, and was getting nowhere.

Villeroy · April 15, 2022, 5:24pm

SELECT * FROM TBL ORDER BY VAL DESC LIMIT 10

is the simple solution. It simply cuts off all records after position #10 from the ordered row set. However, HSQL does not allow any aggregation from ordered row set as in

SELECT SUM(VAL) AS "Sum" FROM (SELECT * FROM TBL ORDER BY VAL DESC LIMIT 10)

OMG. HSQL 2.4 runs that query and I did not notice. Everything I wanted to explain now is obsolete. Replace the MovingAvg.odb with

MovingAvg.odb (29.2 KB)

and have a look at the new form.

The queries:
qFilter selects the filtered items (grey table control)
qLargest sellects the largest 10 using the simple ORDER BY VAL DESC LIMIT 10 (yellow table control)
qLargest_Ag is the aggregation of qLargest (yellow boxes)
qLatest sellects the latest 10 using the simple ORDER BY ID DESC LIMIT 10 (orange table control)
qLatest_Ag is the aggregation of qLatest (orange boxes)
qLatest_All returns the 10 latest of all items
qLatest_All_Ag is the aggregation dumped into the report.

How to copy data from a form’s table grid into a spreadsheet:
Click the grey top-left corner between row selectors and column headers. then right-click copy. Or drag that corner onto a cell. Sometimes this transfers some very weird cell formattings. Just remove hard formatting.

Villeroy · April 15, 2022, 6:25pm

See FreeHSQLDB v.0.3 (View topic) • Apache OpenOffice Community Forum
Copy modules “AutoInstall” and “Helpers” into the database document and edit the named constants.

For the conversion from embedded 1.8 to external 2.4.1 I used [Python] Macro to extract and reconnect embedded HSQLDB (View topic) • Apache OpenOffice Community Forum

Query qFilter needed a small adjustment because 2.4 does not accept the alias name “SR” in the WHERE clause.