TXT conversion does not produce 100% ASCII text!

stdc66 · September 30, 2023, 2:10pm

[ Environment: Fedora 38, Linux 6.5.5, LibreOffice 7.5.6.2(X86_64) ]
I have a large ODT document.
I convert to text with the command line: $ libreoffice --headless --convert-to txt docname.odt
I get a new document called docname.txt
Problem is that docname.txt IS NOT ASCII TEXT…but contains unicode for stuff like bullet points!
So…not ASCII TEXT…but mostly ascii text with some unicode mixed in.
THIS IS NOT ASCII TEXT!!
Please help me out here…I want ASCII TEXT output!

Wanderer · September 30, 2023, 2:24pm

Are you sure? IMHO that’s a 7 bit code, and I thought nobody uses it today.
.
Try to google for character conversion or charconv. There were a lot of command-line utilities available 20 years ago…

stdc66 · September 30, 2023, 2:46pm

Dear Wanderer:
The headless Libreoffice conversion produces mixed ASCII and unicode… I would like the conversion to produce pure ASCII. Which part of my requirement did you not understand?

mikekaganski · September 30, 2023, 2:56pm

See Starting LibreOffice Software With Parameters and examples in its --convert-to section.

ajlittoz · September 30, 2023, 2:56pm

ASCII is a very reduced subset of Unicode, namely the 128 first glyphs. Outputting to .txt produces Unicode. If you really want strict ASCII file, you need to filter non-ASCII characters and replace them with some ASCII. For example, you complaint about bullet; do you want it to be replaced with a lowercase “o” or a period? In this case, you’ll have to do it yourself.

A tool like awk can do this. Macro generators too.

stdc66 · September 30, 2023, 3:09pm

Dear Ajlittoz:
Thanks for the advice. Actually, I’ve already written some code to do exactly what you suggest.
But then again, I wonder why I need to write code! I thought a conversion to TEXT might mean that I didn’t need to write more code! Clearly TEXT means something different from what I had thought it meant! So now TEXT actually means “ASCII_TEXT + UNICODE”. Mea culpa, Mea culpa…my bad!!!

mikekaganski · September 30, 2023, 3:19pm

Plain text was never the same as ASCII. The latter is just an encoding (one of infinite number of encodings usable in plain text, but also in a multitude of other contexts).

Yes, in LibreOffice, we use UTF-8 by default in TXT export since recently. Earlier, we defaulted to the system encoding.

stdc66 · September 30, 2023, 4:20pm

Absolutely no intention here of having an argument:

(1) I thought “ASCII” or “TEXT” meant that all characters CHAR were in the value range 0<=CHAR<=127.
(2) Unicode is expressed in two or three or four characters, some of which are >=128.

So how can unicode be considered part of the range 0<=CHAR>=127?

What is it that I just don’t understand?

mikekaganski · September 30, 2023, 5:03pm

ASCII - yes
TEXT - no.

This is what you misunderstood.

No. Your misconception seems also to be that you consider “Unicode” to be “everything beyond ASCII”. Actually, ASCII is just a (small) part of Unicode. So, if you talk about some kinds of Unicode encoding (e.g., UTF-8, or GB 18030), any character is encoded using one to four code units (note the “one”). In other kinds of encoding, any Unicode character (including A-Z) may be represented by e.g. four octets (UTF-32), or by two or four octets (UTF-16).

erAck · September 30, 2023, 6:36pm

Export as UTF-7 encoded, then all octets are between 0 and 127 and Unicode characters will be represented as sequences of multiple octets. You might be a little surprised about how your file will look though…
Read UTF-7 - Wikipedia.

Wanderer · September 30, 2023, 9:13pm

Usually we use it the other way round: An ASCII- and some 8-bit Character-encodings are also valid Unicode-files for UTF8. So ASCII is part of Unicode not Unicode part of ASCII.

I understood, what you asked for, but not why.
I have over 40 years converted between ISO 8859-n, and Codepages like IBM-437 or 850, so UTF8 is quite relaxing for me. And living in Europe ASCII was never a valid option for Germany, France, Sweden etc. So I don’t know any software wich today needs ASCII…
.
And Text=ASCII implies hebrew, cyrillic or japanese letters don’t form text??