Help with TreeSort - new algorithm for sorting etc

KamilLanda · January 1, 2023, 9:05am

TreeSort is an algorithm for sorting, removing a duplicities and finding an uniques – all at once. It is possible to extend it also for fast searching and also for searching that ignores the lower/upper-cases or diacritic.
There isn’t any comparison of strings nor swapping ones. The algorithm detects used characters in 1st pass, then it makes the statistics about characters in 2nd pass (this statistics has form of tree), and one reading of this tree returns the result.

I split it to 4 parts - Simple, Extended, Complex and Multi. It is fresh work, and examples are 1st prototypes :-). There is Simple and Extended variant in LibreBasic, and Simple in Python.

There are places for improvements and optimalization. For example I don’t know how to read the tree without recursion that takes a lot of RAM - function recurArray.

ajlittoz · January 1, 2023, 9:27am

I think you posted on the wrong site. Here we are interested in questions related to usage and good practice in LO from the user’s point of view.

A more adequate site for general programming would be some from the Stack Excange family.

Sorting algorithms are one which need breaking advantages to be adopted. Have you studied Knuth’s general presentation in Art of Computer Programming? Have demonstrated stability behaviour of your algorithm? Have you estimated an asymptotic time Big-O for it? the requirement for memory?

What are the advantages compared to production-proven common algorithms? What is its best target?

All these questions and many others must be addressed, but not on AskLO.

erAck · January 1, 2023, 4:39pm

Well, it is related, the repository material are .odt and .bas files…

KamilLanda · January 1, 2023, 4:50pm

I would like to use this algorithm in my LibreOffice extension for fast repeated searching the dictionary.
The examples are in LibreOffice Basic and Python, so I hoped maybe somebody help me to remove the recursion or make some optimalization, because here are some people that know these languages well :-).

universally:

algorithm is stable, there isn’t anything to do instability.
time is linear
requirement for memory - it seems there is only chance to estimate the maximum size of RAM it could be used. But I don’t know how to do it.The RAM could be the biggest problem.
the advantage is speed, it seems it is faster to create and read the tree than compare strings and swapping.

I haven’t proofs for these theses, but if you will comprehend how it works, I think you will discover the same :-).

I’m sorry if you think it isn’t good for LibreOffice, I contribute only to this forum and sometimes to FontForge forum, so I hoped I recieve some advice here for example for LibreBasic :-).

ajlittoz · January 1, 2023, 5:20pm

It is however a very loose relationship. The .odt material tries to describe the data structure without pinpoint details. I have the feeling that LibreBasic was used for a proof-of-concept because it is backed by LO UI, allowing to display “easily” results.

I still don’t see to which LO problem it could answer. The proposed algorithm seems to belong in the dictionary-based look-up query. It has a non-negligible initialisation cost (necessity to collect all words and built the tree storage. It is quite likely that an existence request is quite fast. But if usage is intended for a dynamic environment, it is important to evaluate insertion of a new word and deletion of an existing word costs.

And these questions are rather mathematically based than LO based.

Even if a possible application is spell-checking, improved variants of Knuth-Pratt-Morris could be faster because they make use of the intrinsic redundancy in the target string, which this proposal does not. You must explore all characters in sequence.

ajlittoz · January 1, 2023, 5:36pm

A sorting algorithm is “stable” if it keeps the order of inputs in case of a tie. Not relevant here since you apparently remove duplicates.

I don’t trust this assertion. It needs a strong mathematical demonstration to be endorsed. You must provide at least two estimates: time to find an existing word (will depend on number of words in dictionary); time to assert word is missing from dictionary (usually quite different from the previous estimate)

That why most known algorithms preprocess strings is some way. Even if you don’t compare strings as global object, you’ll still compare bytes (even with multibytes character set like Unicode).

In efficient algorithms you avoid moving characters (swapping, rearranging, …) because (1) not all bits carry information, (2) strings tend to be rather long. Instead, you play with pointers for which efficient hardware instructions exist.

You should weigh how many times you need to build your dictionary versus how many times you query it. Sometimes a very clever algorithm is defeated by the cost of building its auxiliayre tables.

Personally, I’ve used optimised Knuth- Pratt-Morris with latest improvements by Lecroq. Results are really impressive but understanding the theorems and the algorithm is not immediate. But the resulting programming is surprising short (and hard to debug when something goes wrong).

KamilLanda · January 1, 2023, 7:12pm

I used LibreBasic because it was the easiest way for me to test the algorithm is functional, no to bring forward some mathematical proof :-). And I would like to use it in LibreOffice extension.

The insertion of new word is easy, it need only add the branch(es) if the related sub-branch isn’t exists; or add the number 1 if sub-branches are created for this sequence of letters.
The same with delete of word. Remove the count of word or remove the sub-branch. Nothing complicated.

I thought some other definition of “stable”, thank you for the specification. It could be stable in Complex variant with the indexes to primal data. But it will take more RAM. And I’m not so far and Complex variant isn’t started.

The time to find an existing word is short, if you have for example 5-letters word ACEBA, then you do finding like this. You know from “filter” (array for branches) the A has index 1, B 2, C 3, E 4. So you control the array(1)(3)(4)(2)(1).
if isArray(1)(3)(4)(2)(1) then return array(1)(3)(4)(2)(1)(0) that means the count of word, and if this number>0 then word exists. OR if (1)(3)(4)(2)(1)>0 then word exists.
Of course in Librebasic you need to “jump over indexes” step by step: if isArray(1) then if isArray(1)(3). But in Python it is possible to use eval and catch the exception.

Of course the test if word is missing is the same :-).

There isn’t comparison of bytes, because it works with Ascii codes in LibreBasic: Asc(character) and it only “checks the indexes” in arrays (branches) that are set from “filter”. For every Ascii code it uses the index from primal “filter” and “check” the element in array for this index.
So the primal array with indexes from Ascii codes is (for example for ABCE): array(65 to 69) with the values (1, 2, 3, x, 4). And it returns 1 for Ascii 65, 2 for 66, 3 for 67 and 4 for 69. So for every letter it only takes the index and looks to the branch for the element with this index → and tests isArray or isEmpty or takes the number from this element.

I need to build a dictionary only once time, but I need search it more times and the most quickly. I tested the cs/en dictionary I use, and repeated searching (about 15x) took about 15 seconds in Python, but the searching with TreeSort should be in microseconds even in LibreBasic :-).

But my main problem is the RAM that is “eaten” via reading the tree by recursion, if somebody know how to remove the recursion for reading the tree, it will be very helpful for me now :-).

Lupp · January 1, 2023, 11:21pm

@KamilLanda: Can you excuse me for commenting without having studied your code in advance?

I was much interested in sorting about 40 years ago, but I’m too old now to remember all the details. Anyway: When I had “invented my optimal sorting algorithm” I found that it was known in the literature (D.E. Knuth namely) as “radix sort” [what I thought was a very bad name]. However, it is a non-comparing algorithm, and can therefore not be used with arbitrary collators. You need any kind of “alphabetization”: a lexicographic order based on some alphabet. Functionally such an algorithm reverses the relation between the concepts of comparing and sorting: You compare two items actually by looking which one comes first after sorting: lexicographic comparison. An implementation of the algorithm may be seen as the construction of a kind of ‘tree’, described as a sequence of “packs” like used for a card-game. Exemplified using the unextended latin alphabet you can see every letter as the root of a stack to which you add on top the items starting with that letter. Having done so looking at every first letter exactly once, you have to look on every second letter exactly once - once at most if one-letter-words may occur and can be output in advance. This way we have evidence that any set of n items assured to have at most m letters per element can be sorted performing at maximum m*n fundamental steps. Read this as : “There is a linear upper limit in runtime.” or “Time complexity is limited linearly.”
The good news: You don’t need a new “plant” of stacks for the recursion. Having finished one level you can (logically) put all the packs one by one in the given order on top of the others only memorizing the limits: “bottom of stack” must be set as an attribute for the respective item. Now processs the top stack till bottom exactly the same way recursivley outputting whatever is ripe for output.
“Storage complexity is limited linearly.” (Basically one chain.)
In short: Don’t organize the data while sorting is done as a tree, but as a chain.

BTW: I would like a testing example of random words readily prepared to be passed to the Basic implementation of your algorithm for a stepwise inspection. I think you might simply attach it here not needing a third-party-site.

ajlittoz · January 2, 2023, 8:50am

@KamilLanda
Your extended description reminds of a similar problem I had to handle many years ago. I solve it slighlty differently (using criteria to efficiently discriminate between words before ending at character scanning).
I find my personal notes, convert them to present LO and them them through private mail. Stay tuned.

KamilLanda · January 2, 2023, 10:44am

Thanks a lot to all for comments, it is helpfull for me :-)!
Absolutely never mind you didn’t study the documentation yet. My English isn’t so good and I don’t want to afflict somebody with not so good English :-). But the true is there isn’t a lot of text, because there are mainly images with tree for given variants. But let’s solve the Simple variant now, it is easist and fastest :-).
The question is: how to remove recursion in step 4)?

@Lupp
I decided if to add the example in Basic here, but there are 3 examples (2x BAS, 1x PY) a I didn’t want to “overflow” this Ask at start with more examples :-).

I let Czech letters in constant sINIT, so change it for your example. And fill array p() with your words to test.

private gpi() 'array with the indexed letters
private gi&, gpOut() 'index and array for result

Sub TreeSortSimple
	on local error goto bug
	dim oDoc as object
	dim p(), i&, j&, k&, s$, s1$, min&, max&, p1(), iCount&, a&, index&, iLen&, iEmpties&
	max=0 : min=&hffff& 'maximal and minimal Ascii code for used characters (min=&h10ffff& for Full-Unicode)
	dim pb(max to min) as boolean 'info array, which characters are used
	
	rem small Czech alphabet - edit it for your use, separate the characters by space, the space has reserved word: space
	'! but don't add characters with more letters like 'ch', this version is only for one-letter characters!
	'!!! and there must be all characters used in array p(), this version hasn't any control for unknown letters!!!
	CONST sINIT="space a á b c č d ď e é ě f g h i í j k l m n ň o ó p q r ř s š t ť u ů ú v w x y ý z ž" 'order for sorting
	p1=split(sINIT, " ")
	for i=lbound(p1) to ubound(p1)
		if p1(i)="space" then
			p1(i)=" "
			exit for
		end if
	next i
	
	rem the array with items to sort :-) -> use only characters that are in constant sINIT !!!
	p=array("aa", "baba", "ba", "oba", "baobab", "ob", "bob", "boa", "a", "aa", "aaa", "ah", "ahoj", "ahu", "aha", "ah", "aa", "", "", "grázl", "aa", "aab", "cáco", "cácora", " chamraď", "čmuchači", "chechot", "vochechule", "vochmelka", "smraďoch", "svoloč", "tupec", "jasoň", "buřič", "drsoň", "blivajz", "lemra", "čumič", "žvanil", "grázl", "jasoň", "buřič", "drsoň", "bli vajz", "blivajz", "lemra", "čumič", "bab", "cab", "bábovka", "bab", "cab", "babizna", "aaa", "aa", "bb", "aa", "a a", "aa", " aaa", "aa", "aa", "aaa", "aa", "a", "b", "bb")

rem 1) Detect used characters
	if ubound(p)=-1 then exit sub 'entry input array
	for i=lbound(p) to ubound(p)
		s=p(i) 'current word
		if s<>"" then
			iLen=len(s)
			for j=1 to iLen
				a=asc(mid(s, j, 1)) 'ascii code of current character
				if a<min then min=a 'minimal ascii code
				if a>max then max=a 'maximal ascii code
				pb(a)=true 'character is used
			next j
		else 'count of empty strings
			iEmpties=iEmpties+1
		end if
	next i

rem 2) create indexed arrays
	redim gpi(1+ubound(p1)) : i=0
	dim p2(min to max) 'the index is the ascii code, the value is the sorting order
	for each s in p1
		a=Asc(s)
		if a>=min AND a<=max then 'maybe the character is used
			if pb(a) then
				iCount=iCount+1
				gpi(iCount)=s 'character is used
				p2(a)=iCount
			end if
		end if
	next
	redim preserve gpi(iCount)
	'xray gpi
	'xray p2

rem 3) build the tree
	dim tree(ubound(gpi)) : tree(0)=0 'first element is: array(Count of word; true/false = if the current branch has some sub-branches)
	dim level as variant, p3() 'level is current branch; p3() is temporary array
	for i=lbound(p) to ubound(p)
		s=p(i) : iLen=len(s) 'current word
		level=tree 'current branch for loop
		for j=1 to iLen 'go through characters in word
			s1=mid(s, j, 1) 'current character
			rem add the array to the tree
			index=p2(asc(s1))
			if isEmpty(level(index)) then 'still no character in cell in branch
				if j=iLen then 'last character from word, so it means end of word
					level(index)=1 'put only number
					tree(0)=tree(0)+1 'increase the count of all words
				else 'no last character
					redim p3(iCount) 'new branch with empty values
					p3(0)=0 'count is 0; but the sub-branch is
					level(index)=p3 'add new branch to tree
					level=level(index) 'take new branch
				end if
			elseif isArray(level(index)) then 'the letter is in branch
				if j=iLen then 'last character from word, so it means end of word
					level(index)(0)=level(index)(0)+1 'increase the count of word
					tree(0)=tree(0)+1 'increase the count of all words
				else 'no last character
					level=level(index) 'take new branch
				end if
			else 'the letter was last for some previous word, so move the number of count to new sub-array to index(0)
				redim p3(iCount) 'new branch with empty values
				p3(0)=level(index) 'count is moved to new branch
				level(index)=p3 'add new branch to tree
				level=level(index) 'take new branch		
			end if	
		next j
	next i

rem 4) read the tree
	'xray(tree)
	redim gpOut(tree(0)) 'same size for output array like the input array
	a=msgbox("Yes - list without duplicities" & chr(13) & "No - only uniques", 35)
	if a=6 then 'Yes - LIST
		gi=0
		recurArray(tree, "") 'condition in recursion is >0
		gpOut(0)=array("TOTAL", tree(0)) 'total count of items		
	elseif a=7 then 'No - UNIQUES
		gi=1
		for i=1+lbound(tree) to ubound(tree) 'for uniques is problem with 1st index tree(0) because it is total count of all items, so it need go from tree(1)
			if NOT isEmpty(tree(i)) then recurArrayUniques(tree(i), gpi(i)) 'condition in this recursion is =1
		next i
		gpOut(0)=array("UNIQUES", gi-1) 'total count of items
	else 'Cancel
		exit sub
	end if
	redim preserve gpOut(gi-1)

rem 5)	report the result
	s=""
	for i=lbound(gpOut) to ubound(gpOut)
		s=s & gpOut(i)(0) & " " & gpOut(i)(1) & "×"
		if i<> ubound(gpOut) then s=s & chr(13)
	next i
	msgbox(s, 0, "Sorted & Counted")
	exit sub
bug:
	bug(Erl, Err, Error, "TreeSort")
End Sub

Function recurArray(level as variant, s$) 'recursion for reading the branches, output to global array gpOut
	on local error goto bug
	dim i&
	if level(0)>0 then 'for the count of current word
		gpOut(gi)=array( s, level(0) ) 'add to global output array
		gi=gi+1
	end if
	if isArray(level) then
		for i=1+lbound(level) to ubound(level) 'next indexes in branch
			if NOT isEmpty(level(i)) then recurArray(level(i), s & gpi(i)) 'get the sub-branch
		next i
	end if
	exit function
bug:
	bug(Erl, Err, Error, "recurArray")
End Function

Function recurArrayUniques(level as variant, s$) 'only one difference unto recurArray is in condition If ... Then
	on local error goto bug
	dim i&
	if level(0)=1 then 'for the count of current word
		gpOut(gi)=array( s, level(0) ) 'add to global output array
		gi=gi+1
	end if
	if isArray(level) then
		for i=1+lbound(level) to ubound(level) 'next indexes in branch
			if NOT isEmpty(level(i)) then recurArrayUniques(level(i), s & gpi(i)) 'get the sub-branch
		next i
	end if
	exit function
bug:
	bug(Erl, Err, Error, "recurArrayUniques")
End Function

Sub bug(sErl$, sErr$, sError$, sFce$) 'error message
	msgbox("line: " & sErl & chr(13) & sErr & ": " & sError, 16, sFce)
	stop
End Sub

@ajlittoz
I also re-discovered Radix sort before about 5 years, but TreeSort isn’t Radix :-).

erAck · January 2, 2023, 12:16am

A possible approach instead of recursion, as you seem to already have a tree (I didn’t read the source), would be a visitor while traversing the tree either in depth-first or in breadth-first. There are different algorithms that traverse in different orders and instead of a full stack remember either the visited while in use, or best case if it is known and guaranteed to be a directed acyclic graph (DAG) only the next to be visited nodes/vertices, ideally just pointers, like explained in this depth-first or breadth-first. (there are zillions of other explanations on the net, with or without more or less better or worse example code).

KamilLanda · January 2, 2023, 10:45am

I teoretically know it is possible to traverse the tree without recursion …

… but it was problem when I tried to find concrete proceeding. Google is fu.ker - it showed some pages with some theories because this pages are most known. And then only stupid links to some different pages where was the word ‘recursion’ but not with examples for programming. Old classical FullText searching was better than this stupid AI invisibly-maximum-money-to-google searching.
I could try some Czech “mathematical” forums, but there aren’t people that know LibreBasic, and I don’t know well other programming languages :-).