My bug was [\u10000-\u1FFFF], of course properly is like in your example → [\U00010000-\U0010FFFF] :-).
So I tested and com.sun.star.util.TextSearch is mostly faster than com.sun.star.sheet.FunctionAccess.
sub testA
dim s, c, d, i
d=GetSystemTicks()
for i=1 to 10000
s="‹Simboli grafici ascii› utili (AscSym) ➁" & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]") '11,344s
'c=getCountRegex(s, "[\U00010000-\U0010FFFF]") '6,674s
next i
d=GetSystemTicks()-d
msgbox(s & chr(13) & d/1000 & " seconds")
end sub
It is posisble to use inStr() to speed up the detection of >64k chars
sub testB
dim s, c, d, i
d=GetSystemTicks()
rem string with normal characters
for i=1 to 5000
s="‹Simboli grafici ascii› utili (AscSym) "
if inStr(s, chr(55348))>0 then
'c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]")
c=getCountRegex(s, "[\U00010000-\U0010FFFF]")
end if
next i
rem string with >64k chars
for i=1 to 5000
s="‹Simboli grafici ascii› utili (AscSym) " & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
'c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]")
c=getCountRegex(s, "[\U00010000-\U0010FFFF]")
next i
d=GetSystemTicks()-d
msgbox(d/1000 & " seconds")
rem splitRegex_02 '5,518s
rem getCountRegex '3,392s
end sub
But the fastest is simple ubound(split(s, chr(55348))) to detect 64k+ chars characters
sub testC
dim s, c, d, i
' d=GetSystemTicks()
' for i=1 to 10000
' s="‹Simboli grafici ascii› utili (AscSym) " & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
' 'c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]") '11,292s
' c=getCountRegex(s, "[\U00010000-\U0010FFFF]") '11,158s
' next i
' d=GetSystemTicks()-d
' msgbox(d/1000 & " seconds")
' d=GetSystemTicks()
' for i=1 to 10000
' s="‹Simboli grafici ascii› utili (AscSym) " & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
' if inStr(s, chr(55348))>0 then
' 'c=splitRegex_02(s,"[^\U00010000-\U0001FFFF]") '11,254s
' c=getCountRegex(s, "[\U00010000-\U0010FFFF]") '6,845s
' end if
' next i
' d=GetSystemTicks()-d
' msgbox(d/1000 & " seconds")
rem with split(), the count for loop is increased because split() is fast
' d=GetSystemTicks()
' for i=1 to 200000
' s="‹Simboli grafici ascii› utili (AscSym) " & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
' if inStr(s, chr(55348))>0 then
' c=ubound(split(s, chr(55348))) '7,285
' end if
' next i
' d=GetSystemTicks()-d
' msgbox(d/1000 & " seconds")
d=GetSystemTicks()
for i=1 to 200000
s="‹Simboli grafici ascii› utili (AscSym) ➁" & chr(55348) & chr(56753) & chr(55348) & chr(56754) & chr(55348) & chr(56672) & chr(55348) & chr(56674)
c=ubound(split(s, chr(55348))) '5,623s
next i
d=GetSystemTicks()-d
msgbox(d/1000 & " seconds")
end sub
I read one article about some encodings before few days and there was written for example China or Korean characters must be in UTF-16 because there isn’t enough of space in UTF-8 for ones.
So what to add the optional parameter to the function? And set the detection of 64k+ as default, but have the possibility to turn off the detection if user is sure he hasn’t any long character?
And there must be also the detection for 64k+ chars also for vc.goRight(len(oTR.string), true) 'extend multiselect viewCursor to end of current TextRange and select it
, because try to add some long chars to the Headings and the problem will occur :-(.