My bug was [\u10000-\u1FFFF], of course properly is like in your example → [\U00010000-\U0010FFFF] :-).
So I tested and com.sun.star.util.TextSearch is mostly faster than com.sun.star.sheet.FunctionAccess.
sub testA
dim s, c, d, i
d=GetSystemTicks()
for i=1 to 10000
s="‹Simboli grafici ascii› utili (AscSym) ➁" & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]") '11,344s
'c=getCountRegex(s, "[\U00010000-\U0010FFFF]") '6,674s
next i
d=GetSystemTicks()-d
msgbox(s & chr(13) & d/1000 & " seconds")
end sub
It is posisble to use inStr() to speed up the detection of >64k chars
sub testB
dim s, c, d, i
d=GetSystemTicks()
rem string with normal characters
for i=1 to 5000
s="‹Simboli grafici ascii› utili (AscSym) "
if inStr(s, chr(55348))>0 then
'c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]")
c=getCountRegex(s, "[\U00010000-\U0010FFFF]")
end if
next i
rem string with >64k chars
for i=1 to 5000
s="‹Simboli grafici ascii› utili (AscSym) " & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
'c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]")
c=getCountRegex(s, "[\U00010000-\U0010FFFF]")
next i
d=GetSystemTicks()-d
msgbox(d/1000 & " seconds")
rem splitRegex_02 '5,518s
rem getCountRegex '3,392s
end sub
But the fastest is simple ubound(split(s, chr(55348))) to detect 64k+ chars characters
sub testC
dim s, c, d, i
' d=GetSystemTicks()
' for i=1 to 10000
' s="‹Simboli grafici ascii› utili (AscSym) " & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
' 'c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]") '11,292s
' c=getCountRegex(s, "[\U00010000-\U0010FFFF]") '11,158s
' next i
' d=GetSystemTicks()-d
' msgbox(d/1000 & " seconds")
' d=GetSystemTicks()
' for i=1 to 10000
' s="‹Simboli grafici ascii› utili (AscSym) " & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
' if inStr(s, chr(55348))>0 then
' 'c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]") '11,254s
' c=getCountRegex(s, "[\U00010000-\U0010FFFF]") '6,845s
' end if
' next i
' d=GetSystemTicks()-d
' msgbox(d/1000 & " seconds")
rem with split(), the count for loop is increased because split() is fast
' d=GetSystemTicks()
' for i=1 to 200000
' s="‹Simboli grafici ascii› utili (AscSym) " & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
' if inStr(s, chr(55348))>0 then
' c=ubound(split(s, chr(55348))) '7,285
' end if
' next i
' d=GetSystemTicks()-d
' msgbox(d/1000 & " seconds")
d=GetSystemTicks()
for i=1 to 200000
s="‹Simboli grafici ascii› utili (AscSym) ➁" & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
c=ubound(split(s, chr(55348))) '5,623s
next i
d=GetSystemTicks()-d
msgbox(d/1000 & " seconds")
end sub
I read one article about some encodings before few days and there was written for example China or Korean characters must be in UTF-16 because there isn’t enough of space in UTF-8 for ones.
So what to add the optional parameter to the function? And set the detection of 64k+ as default, but have the possibility to turn off the detection if user is sure he hasn’t any long character?
And there must be also the detection for 64k+ chars also for vc.goRight(len(oTR.string), true) 'extend multiselect viewCursor to end of current TextRange and select it, because try to add some long chars to the Headings and the problem will occur :-(.