Performant Array sort (in memory) according to date/time

Hello everyone,

I wanted to sort an array of relatively large dimensions. (100 x 10000).
I am aware of sorting data in the calc sheets, however updating on the sheets is awfully slow.
Additional difficulty is a bit of a strange time format according to which I want to sort.

If you have better ideas for other datatypes, please let me know.

ReDim a(6,3)

a(0,0 ) = “03/12/2021;16:20:00”
a(1,0 ) = “03/12/2021;16:19:00”
a(2,0 ) = “03/12/2021;16:19:01”
a(3,0 ) = “05/12/2021;16:20:00”
a(4,0 ) = “03/12/2021;16:20:00”
a(5,0 ) = “31/12/2021;16:20:00”
a(6,0 ) = “03/08/2021;14:45:35”

a(0,1) = "som"

a(1,1) = “som”
a(2,1) = “so§em”
a(3,1) = “soe1m”
a(4,1) = “soesm”
a(5,1) = “soeasdfm”
a(6,1) = “soeme”

Converting date-time to text (or entering such data as text) spoils everything if you not use exclusively a reasonable format like ISO 8601 extended.. Dates given as texts formatted the improper way preferred in the USA is a very bad idea.
Dates represented as numbers as they mostly are in spreadsheets sort completely independent of the chosen NumberFormat.
What do you mean by “sort in memory”? The standard sorting tool you can use via the UI dialog is working with high efficiency in memory, of course, and makes its output to a sheet range.
Sorting Arrays in Basic is a bad idea again, and should only be done if clearly unavoidable. In case of date-time given in bad textual representations, it’s surely more efficient, to convert them to their numeric representations in a helper column, and then to use the standard tool.

1 Like

@Lupp I parsed this out of an xml, I can set the format of the string, but unfortunately it will be weired anyways…

What conversion would you recommend? In which format should I store them? (i could change it directly when i import it.

Regarding the Array sorting, I need it because I have the whole thing in memory first before it is even written to a sheet. As I will only partially use it (hence the filtering), I want to avoid showing all in the UI sorting and then deleting again…

Check the attached example.
disask76139_datetimesortingBasedOnTexts.ods (29.8 KB)

There are many ways to prepare texts as needed and to convert textual date-time to a standardized numeric representation.
Generally you should always prefer to negotiate ISO conforming communication of data.
If anybody insists on bad formats you can use text manipulating formulas in Calc to get it right - if your partner at least clearly states an assures the format he (f/m) actually used.

Once more I would state that this is the wrong way. Write your data into a sheet (may be a helper sheet disposable later), and sort them there using the efficient tools Calc provides. If you want to do that “unaudited”, you can write a short routine using the mentioned tools via API calls.

If you insist on sorting raw data in memory using Basic, you will need to study the algorithmic of sorting to some degree. Primitive algorithms are slow (inefficient). Better ones may need adapting.

Hi Lupp, thank you for the effort, I am aware of the typical string manipulation methods of libre. Replicating that in basic should be easy. However the main problem of sorting the mess via macro is still on hand.

I edited my previous comment.
If you insist on “I do it my way” you may need to pave that way yourself.

In my opinion, such tasks are much more convenient to solve in Calc.
In order for the text “03/12/2021;16:20:00” to be converted into a date, add a column with formulas like:

=--(MID(A1;7;4) & "-" & MID(A1;4;2) & "-" & MID(A1;1;2) & " " & MID(A1;12;8))

and sort by the new column in normal way.
The same path is recommended above by colleagues @Lupp and @Villeroy .

Calc’s built-in methods (in our case, sorting) have a big efficiency advantage over macros written in the Basic language.

  • If you are familiar with XSLT, you can write a filter which converts your XML into Calc XML, load the file directly into a spreadsheet and sort it.
  • If your XML happens to be Excel XML (an old XML format MS implemented some 20 years ago), then there is a filter for that shipped with your office suite, displayed as “MS Excel 2003 XML” in the XML filter settings dialog. If you happen to use MS Windows, you may need to run a custom install and add something like “experimental XML filters” or similar.
  • If your XML happens to be something else, Data>XML Source… may give you the opportunity to parse it. I never used that.
  • If you are familiar with Python, that programming language is shipped with your office suite as an alternative macro langauge. it can process XML and sort arrays among thousands of other things.

If you have strings in sheet cells, convert them to correct cell values and let the spreadsheet application do the sorting. Nothing is faster than that.

Like this:

REM  *****  BASIC  *****

Sub Main
    ReDim a(6,3)

    a(0,0 ) = "03/12/2021;16:20:00"
    a(1,0 ) = "03/12/2021;16:19:00"
    a(2,0 ) = "03/12/2021;16:19:01"
    a(3,0 ) = "05/12/2021;16:20:00"
    a(4,0 ) = "03/12/2021;16:20:00"
    a(5,0 ) = "31/12/2021;16:20:00"
    a(6,0 ) = "03/08/2021;14:45:35"

    a(0,1) = "som"

    a(1,1) = "som"
    a(2,1) = "so§em"
    a(3,1) = "soe1m"
    a(4,1) = "soesm"
    a(5,1) = "soeasdfm"
    a(6,1) = "soeme"


Dim da(6)
for i = 0 to uBound(da())
	Redim r(1)
	s = a(i,0)
	r(0) = mid(s, 7,4)&"-"& mid(s,4,2) &"-"& left(s,2) &" "& mid(s,12,8)
	r(1) = a(i,1)
	da(i)= r()
next i
sh = ThisComponent.Sheets.getByIndex(0)
rg = sh.getCellRangeByPosition(0, 0, 1, uBound(da()))
REM sh.getCellRangeByPosition(0, 0, 0, uBound(da())).CellStyle = "MyDateTime"
rg.setFormulaArray(da())
simpleSort rg, 0, True, False
End Sub

Sub simpleSort(oRange, iField%, bAsc as Boolean, bHeader as Boolean)
Dim oField as new com.sun.star.table.TableSortField
Dim oFieldsProp as new com.sun.star.beans.PropertyValue
Dim oHeaderProp as new com.sun.star.beans.PropertyValue
Dim aSortDescriptor()
   oField.Field = iField
   oField.IsAscending = bAsc

   oFieldsProp.Name = "SortFields"
   ' array of c.s.s.table.TableSortField (we set one of max. 3):
   oFieldsProp.Value = Array(oField)
   ' array of c.s.s.beans.PropertyValue
   oHeaderProp.Name = "ContainsHeader"
   oHeaderProp.Value = bHeader
   aSortDescriptor() = Array(oHeaderProp, oFieldsProp)
   oRange.sort(aSortDescriptor())
End Sub

I actually tested with 10000 rows, 15 columns each, containing textual date-time values as described (in column C). Everything was organized by a little (rather dirty) Basic program, but used API means, including a hidden spreadsheet document only created for interim usage. The conversion was done by a programmatically created formula filled down by .fillAuto method in the hidden sheet…

On my discounter system (Jan. 2012, recently updated with SSD) the needed times were:

  • Reading line by line from a text file and splitting into columns (no csv import used here): 1.9 s
  • Conversion of the textual date-time values to numeric representation: 0.43s
  • Sorting all the data using the numeric D-T as the only key: 0.15 s
  • Moving the results to the target sheet using the DataArray, and closing (disposing) the helper document: 0.97 s

You can’t get a comparably efficient sorting in Basic using native arrays.
You see, the time needed for the sorting was marginal.
Most time you can save by omitting the helper sheet and directly writing what you read and prepared from a text file to the final target sheet.

Even using full-grown programming systems (no Basic), you can only be faster with very large sets, and running a non-comparing algorithm like “radix sort” (named so for unknown reasons).

[edit datetime=2022-04-08T10:30 UTC]
Just ran another test not splitting the stop-watch times and got for the overall task:

  • For 10000 rows 3.49 s
  • For 20000 rows 6.17 s

This looks like “better than linear time-complexity” and may indicate that actually a radix-sort is used (probably only in the numeric case. l10n adapted collators will not be able to support this way.). Who does know?
[/edit]
[edit datetime=2022-04-08T10:45 UTC]
An additional experiment sorting based on the textual representation of D-T in the format
YYYY-MM-DD HH:MM:SS did not back the above idea. The time-complexity again was sub-linear, and the values were 4.51 s for 10000 lines; 6.99 s for 20000 lines.
[/edit]

1 Like

Thank you for this effort, I tried a lot of stuff and the SF Array sort is only good for 1D arrays. I will go with a hidden sheet that I spawn when required. This can be closed. Thank you all for the contribution

Indeed Sort() is for 1D arrays.
SortRows() (and SortColumns()) is for 2D arrays.

The main problem of sorting the mess via macro is still on hand.

The ScriptForge library contains an Array service
(read ScriptForge.Array service (SF_Array))
with a SortRows() method that does functionally exactly what you request.

It uses a Heap sort algorithm (see https://en.wikipedia.org/wiki/Heapsort).
Its advantage is that it does not require the rows of the array to be exchanged. Only their indexes are.

However sorting an array of 100 cols and 10000 rows in Basic remains a challenge.

3 Likes

let me try and then i will report! But this one looks very promising

Most (even very silly) D-T-formats can easily be converted to numeric by not too complicated spreadsheet formulas. If there are additional needs concerning the extraction of the source-representation from any specifically shaped file I lack knowledge of details and can’t comment.

The conversion, the sorting and the filtering are better done in a sheet, as long as the limits concerning the total numbers of rows and of lines not are exceeded. The sheet you are creating and using by your code can be one of a helper document that needn’t even be visible.

… and if the source happens to be a database, you can connect a Base document to it. If the dates in the database column are dates, a most simple query will do the job for millions of rows within a second. If the dates are stored as text, a slightly advanced query should sort millions of rows in a few seconds.

Nope, I am extracting a brokerage statement in xml form and have to parse it. Every update I need to check for the delta and append this to the list - importing everything everytime is not possible. Interactive Brokers does not provide more than 1 year worth of data and also you dont want to process thousands of transactions every time.

Sorting and filtering needs to be done in the macro as the solution from the trading community I am in was to have these fancy formulas - in fact so many of them that the sheet becomes incredibly laggy.

I thought I had stated more than once that a (probably hidden) temporary helper sheet can be a wonderful “implementation” of a 2D-array, well to handle with Basic and API means, in specific where filtering and sorting is needed, because that special kind of array comes with highly efficient software tools for the purpose.

I’m not interested in brokerage, and I won’t tell you how to extract information from any files containing it embedded in a XML structure as long as you don’t provide such a file and ask a related question.

But I can tell you how to efficiently sort and filter your information as soon as it is collected in an ordinary Basic array, or line by line in a raw text (“ASCII”) file. Give me such a text file, and I will prove my claim. I hope you agree that a csv file simply is a 2D-array “sui generis” with the advantage of being prepared for exchange/communication.

Once more: If you insist in disbelieving me and other contributors here, you need to study sorting, searching and filtering. (Donald E. Knuth named one of the volumes of his “Art of Copmputer Programming” Sorting and Searching.) This is not exactly LibreOffice or Calc matters.

I bet, you won’t find an algorithm implemented in Basic wich is less inferior to the means I suggested than by a factor of 100.

1 Like

Ok I have tried it:

I have done a performance test @Lupp ,

Method 1:
e.g. printing on a shadow sheet, then using the cursor to find used range, then use the sortdescriptor method to sort the whole thing and print to final result page.
Systemticks: 90

Method 2:
I took the array I had in memory already and used the scriptforge 2D array sorting function form here: ScriptForge.Array service (SF_Array) This thing uses heapsort.
Then printed the whole thing to the final output page.
Result: 2375 Systemticks.

→ Always print to range sort there might be a bit tedious in the code but works up to 30 times faster -.-