Calc, efficiency gap, external workbook, parametric URL

Dear community:

This is an inquiry about the efficiency of spreadsheet software and consequently advocation for LibreOffice (LO) Calc. While trying to advocate LO Calc to business colleagues, I encountered an irreconcilable efficiency challenge between LO Calc and MS Excel. I appreciate any guidance to identify the plausible culprit. Hopefully a solution of business feasibility may be found (not just technical feasibility). Perhaps the efficiency of LO Calc may be brought on par with that of MS Excel.

A report template in MS xlsm format contained many sumifs() and used parametric URLs to access data in multiple external workbooks. Details are listed in the [tests] section.

A similar efficiency gap was observed in my previous business report a few years ago. Both *.xls and *.ods formats were tested then. Similar construct of external workbooks, parametric URLs, and many sumifs(). The efficiency gap forced me to abandon LO Calc and sticked with MS Excel back then.

Wonder if efficiency gap has been observed by the community on alternative platforms? Like MS Windows 10, LO Calc 7.2.2.2 x64, etc.

Thanks in advance for guidance.

Sincerely,
Ray Jahn

[timing in seconds]

00:02.49 MS Excel, all 3 files open in Excel
01:05.07 LO Calc, all 3 files open in Calc
01:40.25 LO Calc, xlsm open in Calc, xlsx closed

The speed gap is drastic, 30x at least.

[tests]

2 xlsx files of source data.
1 xlsm file of report template.
All 3 files were in RamDisk.

xlsm file settings:

  1. manual recalculation mode (F9).
  2. blank cell if 0.0 value.

The xlsm file contains:

  1. parameters.
  2. parametric URLs.
  3. indirect( parametric URL ).
  4. many sumifs() using parametric URLs.

The parametric URLs refer to cell ranges in external workbooks. Parametric URLs were built from parameters and stored at run time. The stored parametric URLs were then used by many sumifs(). There were no constant URLs (constant references) in the xlsm file.

The cell region containing sumifs() extended to column S (page width of legible PDF), and down to row 411 at present. The rows of the report will increase as more items are added to the report. An abridged xlsm file is attached after conversion into ods format due to the restriction imposed by ask.libreoffice.org. The VBA code was retained in the ods file.

[hypothesis]

One plausible candidate capable of such magnitude in efficiency gap could be the file open operation. But I do not know how to test and prove / disprove this hypothesis. Guidance is appreciated.

  1. When does LO Calc execute file open to the external workbooks?
  2. Once for each parametric URL? (not likely, too clumsy)
  3. Once for each sumifs() call?
  4. Once for each = expression (each cell)?
  5. Once for each external workbook? (ideal, seeking evidence)

[system]

MS Windows XP SP3 x32.
CPU Pentium 3.4 GHz.
RAM 4GB.
RamDisk 500 MB by ImDisk 2.0.10, assigned as B: drive.

MS Excel 2007.
LO 5.4.7.2 (last version for XP SP3).
Java JRE 1.8.0.152 x32 (last version for XP SP3).

[MS VBA]

Minor primitive code. Not likely culprit for the efficiency gap between MS Excel and LO Calc.

Option Explicit
Option Base 0  ' 0 default, 1 optional, for MS VBA arrays
rem base 0, arrays, MS VBA
rem base 1, ranges, MS Excel
rem base 0, arrays, LO Basic
rem base 0, ranges, LO Calc

Public Function File_Date_Time(ByVal fdir as String, ByVal fname as String) as Date
Dim ffile as String
ffile = fdir & fname
File_Date_Time = FileDateTime(ffile)
End Function

' MS syntax => b:/[item 20211022 fbrc.xlsx]
Public Function link_ms_path(ByVal fdir as String, ByVal fname as String) as String
Dim txt_path as String
txt_path = fdir & "[" & fname & "]"
link_ms_path = txt_path
End Function

' MS syntax = 'fdir[fname]sht'!(col+rhead):(col+rtail)
' 'b:/[item 20211022 fbrc.xlsx]2022'!W2:W170
Public Function link_ms(ByVal fdir as String, ByVal fname as String, ByVal sht as String, ByVal rhead as Integer, ByVal rtail as Integer, ByVal col as String) as String
Dim txt_path as String
Dim txt_rang as String
txt_path = link_ms_path(fdir, fname)
txt_rang = "'" & txt_path & sht & "'!" & col & rhead & ":" & col & rtail
link_ms = txt_rang
End Function

item report.ods (21.7 KB)

[erAck: added ``` lines surrounding code blocks, see Guide]

Thanks for a very interesting topic.
In order to carry out comparative studies, we need a test file (or a macro that creates a test file) and a description of the testing methodology.
At first glance, the UDFs in your example are not optimal for Excel (and even less so for Calc).
We are waiting for an example file.

By the way, the SUMIFS function in Excel does not work with closed workbooks (and starts working if these workbooks are opened).

There is one test missing: How long does it take to load a foreign file format (ODF) in Excel?
Use Excel if you want to work with xlsx. Use Calc if you want to work with ods. Of course, both programs support import of foreign document formats but this should be a one-time process if you continue editing with the respective native file format of your preferred application.

Dear Lupp and sokol92:

Some info. regarding your query in the feedback

Sincerely,
Ray Jahn

[timing in seconds]

00:02.89 MS Excel, xlsm + 2 xlsx, URL in MS syntax, all open in Excel.
01:39.16 LO Calc, 3 ods files, URL in LO syntax, all open in Calc.
01:41.68 LO Calc, 3 ods files, URL in LO syntax, report open in Calc, 2 ods closed.
01:56.48 LO Calc, ods + 2 xlsx, URL in LO syntax, ods open in Calc, 2 xlsx closed.

The efficiency gap will increase when the 2 data files reside on conventional hard disk on PC, or on file server. Slower drive speed and network latency will exacerbate the gap.

[tests]

all files on RamDisk.
report region: column S, down to row 531.

Native syntax was used for all parametric URLs in respective report templates.
xlsm → URLs by url_ms_range().
ods → URLs by url_oo_range().

MS Excel reading *.ods template →
Failed. MS Excel 2007 could not handle the code module in *.ods. The code module disappeared after MS Excel 2007 read in *.ods template. Will have to change all parametric URLs (url_oo_range() calls) to constant URLs.

[Calc competition]

Source data files are in xlsx format currently reside on file servers. They are owned, updated and controlled by separate business groups.

Hope for some insight about the origin of efficiency gap. It will be to the delight of the LO team and myself if LO Calc could win favor over MS Excel in such report template.

[example data]

This is primitive material tracking, viz. arrival-consumption-inventory. Using spreadsheet technology, not database, a poor man’s survival. Previous DB effort by vendor ends in disaster.

Please contrive any time series. The regression test library at LO team might offer some. Use many sumifs(), each with 3~5 criteria, to repeatedly scan through the cell ranges in time series. Parametric URLs are handy helpers when there are many sumifs() calls. The timing results would be obvious.

Perhaps this efficiency gap test could be included in the regression test suite for LO Calc.

[Basic code]

Additional primitive code. Used in the construction of parametric URLs.

It is a nasty conversion between xlsm and ods formats by using LibreOffice 5.4.7.2. Simple saving in a different format will not work. The VBAProject must vanish. No data sheets or code modules in it. All data sheets and code modules must reside underneath the native ODS tree. Otherwise nothing would work.

Option Explicit
Option Base 0  ' 0 default, 1 optional, for LO Basic arrays
rem base 0, arrays, MS VBA
rem base 1, ranges, MS Excel
rem base 0, arrays, LO Basic
rem base 0, ranges, LO Calc

' FileDateTime() requires simple path, not URL
' FileDateTime( <drive>:/<dir>/<filename> )
Public Function File_Date_Time(byVal fdir As String, byVal fname As String) as String
dim ffile as String
ffile = fdir & fname
File_Date_Time = FileDateTime( ffile )
End Function

' OO syntax => 'file:///b:/filename.xlsx'
Public Function url_oo_path(ByVal fdir as String, ByVal fname as String) as String
Dim txt_path as String
txt_path = "'file:///" & fdir & fname & "'"
url_oo_path = txt_path
End Function

' OO syntax = 'file:///fdir/fname'#'sht'.(col+rhead):(col+rtail)
' 'file:///b:/filename.xlsx'#'shetname'.W2:W170
Public Function url_oo_range(ByVal fdir as String, ByVal fname as String, ByVal sht as String, ByVal rhead as Integer, ByVal rtail as Integer, ByVal col as String) as String
Dim txt_path as String
Dim txt_rang as String
txt_path = url_oo_path(fdir, fname)
txt_rang = txt_path & "#'" & sht & "'." & col & rhead & ":" & col & rtail
url_oo_range = txt_rang
End Function

' MS syntax => b:/[filename.xlsx]
Public Function url_ms_path(ByVal fdir as String, ByVal fname as String) as String
Dim txt_path as String
txt_path = fdir & "[" & fname & "]"
url_ms_path = txt_path
End Function

' MS syntax => 'fdir[fname]sht'!(col+rhead):(col+rtail)
' 'b:/[filename.xlsx]shetname'!W2:W170
Public Function url_ms_range(ByVal fdir as String, ByVal fname as String, ByVal sht as String, ByVal rhead as Integer, ByVal rtail as Integer, ByVal col as String) as String
Dim txt_path as String
Dim txt_rang as String
txt_path = url_ms_path(fdir, fname)
txt_rang = "'" & txt_path & sht & "'!" & col & rhead & ":" & col & rtail
url_ms_range = txt_rang
End Function

item report.ods (25.7 KB)

[erAck: added ``` lines surrounding code blocks, see Guide]

…also Xcel likley starts up from an MSservice in background Windows (by default), where as soffice-service is optional start up with OS … if it is not run ‘as a service’ LO will presumabley take longer to load up than Xcel, even with no documents(?)

edit: also, is the ‘30x’ time differnce really relevant, since the overall times for opening documents with any app is pretty darn quick…(esp. compared to the glory days of XP:) )…are there really enough operations that will require rapid reopening of these documents to such an extent that the fine-scale IO diffrences become relevant?

Have you tried a recent LO version on a recent Windows (or other OS) how performance compares? Nobody is going to investigate a 4 years old LO 5.4

Also, a test case should be complete. As is, there’s a document with a macro that doesn’t do anything because the referenced external files are not available. So nobody can answer “Wonder if efficiency gap has been observed by the community on alternative platforms? Like MS Windows 10, LO Calc 7.2.2.2 x64, etc.”. There’s nothing to measure.

…and while i think about it:
on a hunch, given the hardware and use of ram-disk, i would expect that the load times are massivley dominated by hardware/ram-disc/drivers, and that if moved to an operational environment and metal-hardware the differnece in load times would not scale up too much > i.e even on a 5200rpmHD the differnce in load times might remain at c.1-2 secs (if for example 80-90%+ of load time is hardware related) … i am kinda tempted to test due to XP nostalgia :), but, as noted by @erAck, valid results would require work and, as I suspect, ultimatley diffrences would be marginal/ not operationally signifcant in a fully deployed work environment

Thanks for advice and guidance. Additional test results follow.

[timing in seconds]

MS XP SP3

all files on RamDisk
00:04.28, 1. MS Excel, xlsm + 2 xlsx, all open in Excel.
02:47.02, 2. LO Calc, 3 ods files, all open in Calc.
02:42.07, 3. LO Calc, 3 ods files, report open in Calc, 2 ods closed.
02:48.21, 4. LO Calc, ods + 2 xlsx, all open in Calc.
02:44.49, 5. LO Calc, ods + 2 xlsx, ods open in Calc, 2 xlsx closed.

report on RamDisk, external workbooks on file server
00:22.32, 1. MS Excel, xlsm + 2 xlsx, all open in Excel.
02:45.14, 2. LO Calc, 3 ods files, all open in Calc.
02:45.36, 3. LO Calc, 3 ods files, report open in Calc, 2 ods closed.
02:45.92, 4. LO Calc, ods + 2 xlsx, all open in Calc.
02:51.33, 5. LO Calc, ods + 2 xlsx, ods open in Calc, 2 xlsx closed.

[system, MS XP SP3]

MS Windows XP SP3 x32.
CPU Pentium 3.4 GHz.
RAM 4GB.
RamDisk 500 MB by ImDisk 2.0.10, assigned as B: drive.

MS Excel 2007.
LO 5.4.7.2 (last version for XP SP3).
Java JRE 1.8.0.152 x32 (last version for XP SP3).

[timing in seconds]

MS Windows 10

all files on RamDisk
00:01.44, 1. MS Excel, xlsm + 2 xlsx, all open in Excel.
00:06.81, 2. LO Calc, 3 ods files, all open in Calc.
00:07.16, 3. LO Calc, 3 ods files, report open in Calc, 2 ods closed.
00:06.76, 4. LO Calc, ods + 2 xlsx, all open in Calc.
00:07.02, 5. LO Calc, ods + 2 xlsx, ods open in Calc, 2 xlsx closed.

report on RamDisk, external workbooks on file server
00:02.56, 1. MS Excel, xlsm + 2 xlsx, all open in Excel.
00:07.41, 2. LO Calc, 3 ods files, all open in Calc.
00:06.89, 3. LO Calc, 3 ods files, report open in Calc, 2 ods closed.
00:07.82, 4. LO Calc, ods + 2 xlsx, all open in Calc.
00:07.41, 5. LO Calc, ods + 2 xlsx, ods open in Calc, 2 xlsx closed.

[system, MS Windows 10]

MS Windows 10 Professional, 20H2 19042.928
CPU Intel Core i5-9400F, 2.9 GHz
RAM 8 GB
RamDisk 500 MB by ImDisk 2.0.10, assigned as B: drive.

MS Excel 2010.
LO 7.2.2.2.
JRE 1.8.0.212 x64.
JVM OpenJ9

[tests]

3 files in each test, including
2 files of source data (external workbooks),
1 report template.

timing measurement → Please note the section of [file settings] in report templates. Since the very 1st post in this thread, all timing records indicated the time lapse between the F9 trigger (manual recalculation) and the calculation termination. The termination was monitored by 3 indicators simultaneously:

  1. CPU load by spreadsheet software in MS task manager,
  2. mouse cursor state (whirling),
  3. progress bar of spreadsheet software.

The estimated uncertainty of manual timing by stopwatch is about 0.5~2.0 seconds.

[file settings]

in report templates, both ods and xlsm formats,

  1. manual recalculation mode (F9).
  2. blank cell if 0.0 value.

parametric URLs in native syntax of the report templates,
regardless of the formats of external workbooks of data.

  1. xlsm → MS syntax.
  2. ods → LO syntax.

The report templates (ods, xlsm) contained:

  1. parameters.
  2. parametric URLs.
  3. indirect( parametric URL ).
  4. many sumifs() using parametric URLs.

The parametric URLs refer to cell ranges in external workbooks. Parametric URLs were built from parameters and stored at run time. The stored parametric URLs were then used by many sumifs(). There were no constant URLs (constant references) in the xlsm file.

The cell region containing sumifs() in the report templates extended to column S (page width of legible PDF), and down to row 531 at present. The number of sumifs() calls reflected the real business cases.

Ideally all sumifs() should be anchored to the leading parameter cells of respective report region (see cell content for details) in the report templates. I just did a quick copy-and-paste to install enough sumifs() calls in order to ensure many scans repeating over the parametric URLs in the external workbooks.

3 example files:
1 ods report template.
2 ods external workbooks.
item report n.ods (99.5 KB)
item 20211101 clnt.ods (95.0 KB)
item 20211101 fbrc.ods (25.5 KB)

3 example files:
1 xlsm report template, disguised as ods in file name.
2 xlsx external workbooks.
item 20211101 clnt.xlsx (142.7 KB)
item 20211101 fbrc.xlsx (45.1 KB)
item report n.xlsm.ods (132.4 KB)

[timing analysis]

LO 5.4.7.2 exhibited a large efficiency gap w.r.t. MS Excel 2007.

LO 7.2.2.2 reduced the efficiency gap w.r.t. MS Excel 2010 to about 2.5x ~ 5.0x.

For computation intensive tasks (reports), if the CPU is sufficiently fast, users would not care about a time difference of less than 10 seconds.

[design of efficiency gap test]

The key factor in the test design about the efficiency gap was many scans over cell ranges in external workbooks. This was achieved by

  1. sizable cell ranges in external workbooks,
  2. many sumifs() calls in report templates.

Each data range contained p cells (different among external workbooks).

Each sumifs() contained n sets of criteria, equal to the number of manual filtering if done manually.

Each sumifs() criterion invoked one scan, via parametric URL, over one cell range in the external workbooks.

(m x n) range scans = (m sumifs() calls) x (n criteria),
(m x n x p x 6) cell scans,
repeatedly over mere 6 cell ranges in external workbooks.

The cell ranges in the example external workbooks were deliberately non-contiguous. This would minimize tricks of benchmark optimization on contiguous cell ranges.

The regression by LO Calc could opt to install more sumifs() with more criteria, i.e., more range scans, to appraise the efficiency gap.

[business significance]

On the surface, the report templates were just a bunch of nonsense, excessive sumifs() calls. These sumifs() scanned over and over again the mere few data cell ranges (6 in our case) in external workbooks, adjusting one criterion at a time. There was little technical justication from software engineers’ perspective.

In business, skilled managers instinctively translated such reports into the timing of working capital (cash flow). They spotted the looming surge of capital need, and mitigated its impact in time. A healthy, profitable business could go bankrupt if there were disruption to its cash flow (working capital).

[hypothesis]

Based on these test results, I am inclined to reject my 2021-10-29 hypothesis on file open operation.

The dependency graph (for sequential calculation) emerged as a probable replacement candidate for culprit. Dependency graph is called the calculation chain in MS documents. Optimization of the dependency graph is crucial for efficiency, when numerous, repeated scans over a finite set of cell ranges exist.

Unfortunately I do not know how to test and prove / disprove this hypothesis either. This subject touches upon the core design, efficiency optimization and code implementation of the Calc project. Guidance is appreciated.

[Moderator]

Dear moderator: I would suggest to label this discussion [closed], if the community agrees. Given a reasonable efficiency gap, LO Calc 7.2.2.2 is OK for small size problems (calculation load). The large efficiency gap for LO Calc 5.4.7.2 is not worth further investigation.

Dear community: Thanks for all your encouragement, enlightenment and discourse.

1 Like