Error. ascii' codec can't encode character u'\ufffd'

Hi,

Since a few days I am writing StarBasic custom functions for Calc wich call
external Python functions.

Yesterday I found this error:

-----------------------------------------------------------
BASIC runtime error.
An exception occurred 
Type: com.sun.star.uno.RuntimeException
Message: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128), traceback follows
------------------------------------------------------------

I bypassed it in Python with

   tmp = dTlx[isin]['descrizione_masterchart']
   tmp = tmp.replace(u'\ufffd',"?")

But I think the problem can arrive in other forms, since there can be other unicode characters
in the text i have in my database and here it seems like StarBasic is trying to convert the string to ASCII.

Do you know if it is possible for StarBasic to accept unicode strings from Python ?

I include the string receiving Basic function

Function msDescrizioneMasterchart(v1, v2) as String
	Dim oScriptProvider, oScript
	scriptUrl = "vnd.sun.star.script:masterchart.py$msDescrizioneMasterchart?language=Python&location=user"
    oScriptProvider = ThisComponent.getScriptProvider()
    oScript = oScriptProvider.getScript(scriptUrl)
    Dim out as String
    out =  oScript.invoke(array(v1,v2), array(), array())
    msDescrizioneMasterchart = out
End Function

What you describe looks like a bug; and bugs are offtopic here on this site, and should be filed to the bug tracker.

It’s not clear how to reproduce the problem based on your description, so when filing the bug reprot on the tracker, please provide all that is necessary to reproduce on a clean system: be it required data files, code, configuration, or specific steps.

There are a number of articles on using Unicode with Python. For example Unicode HOWTO — Python 3.9.6 documentation. I do not know if this will help. LibreOffice uses Unicode.

Please also be more specific about version of LO and version of Operating System.

For better help, show the lines of Python code where the problem occurs, along with example data that shows the problem. Also, be sure to post the entire error message. You left out the traceback that tells where the error occurred. See guidelines for asking.

US-ASCII only recognises the first 127 characters used in Unicode as it is 7 bits. The EURO (€) a unicode character would not be recognised. The /uFFFD is in fact an attempt to show that the character is not supported (� ) It is actually the Unicode replacement character.

Do you know if it is possible for StarBasic to accept unicode strings from Python ?

Yes, it is. For example, take the following code.

Sub call_msDescrizioneMasterchart
    MsgBox msDescrizioneMasterchart(1, 2)
End Sub

Function msDescrizioneMasterchart(v1, v2) as String
    Dim oScriptProvider, oScript
    scriptUrl = "vnd.sun.star.script:masterchart.py$msDescrizioneMasterchart?language=Python&location=user"
    oScriptProvider = ThisComponent.getScriptProvider()
    oScript = oScriptProvider.getScript(scriptUrl)
    Dim out as String
    out =  oScript.invoke(array(v1,v2), array(), array())
    msDescrizioneMasterchart = out
End Function

def msDescrizioneMasterchart(v1, v2):
    return "%s\ufffd%s" % (v1, v2)

Executing call_msDescrizioneMasterchart produces the correct result:

1 replacement_char 2

The problem with your code seems to occur while in Python. The error message is similar to python - UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128) - Stack Overflow.

Also, U+FFFD is the Unicode replacement character, so it probably shouldn’t be showing up at all. There is likely something wrong with your Python code and maybe the data as well.

Thank you all for the many comments/answer,

I will do some other tests later starting from jim_K code.

I post here some more details to reply some of your questions:

1] Garbadge unfortunately arrives with data. Most probably it is an EURO symbol
encoded in some way.

image description

here the hexdump of the original .csv

00038a50  32 32 47 4e 32 32 22 2c  32 30 32 32 30 36 32 32  |22GN22",20220622|
00038a60  0d 0a 22 49 54 30 30 30  35 31 38 38 31 32 30 22  |.."IT0005188120"|
00038a70  2c 22 42 54 50 a4 49 20  30 2e 31 25 20 31 35 4d  |,"BTP.I 0.1% 15M|
00038a80  47 32 32 22 2c 22 49 54  30 30 30 35 31 38 38 31  |G22","IT00051881|

then, the hexdump of .csv converted to JSON, wich is what Python gets

0008d2d0  22 49 54 30 30 30 35 31  38 38 31 32 30 22 2c 22  |"IT0005188120","|
0008d2e0  64 65 73 63 72 69 7a 69  6f 6e 65 22 3a 22 42 54  |descrizione":"BT|
0008d2f0  50 ef bf bd 49 20 30 2e  31 25 20 31 35 4d 47 32  |P...I 0.1% 15MG2|

2] I am using LibreOffice 6.0.2.1.0 in FreeBSD 11.1, installed from package

3] It is definitely a Python error:

python> print u'\x50\xbd\x49\x20\x30\x2e'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 1: ordinal not in range(128)

You must first import the CSV using the appropriate encoding. If you start with wrong encoding then anything else will be garbage. Example: there are two almost identical Greek encodings ISO-8859-7 and WINDOWS-1253 that differ in some characters. If you use the Windows-1253 to import a text that includes the Ά and saved in ISO-8859-7, that character will be converted to 00B6 unicode which is unprintable character.

You probably converted the csv using the wrong encoding, or (the most probable) your CSV already contains text with different encodings for different rows (or string values) which means you added lines using the same encoding but from sources that use different encodings. Thus exporting to json creates unpredictable characters. Is not a python problem.

Edited as erAck correctly commented…

Nitpick: these are not locales, these are text encodings.

I asked the data provider to save the files in UTF8, that should fix it.