[ Environment: Fedora 38, Linux 6.5.5, LibreOffice 7.5.6.2(X86_64) ]
I have a large ODT document.
I convert to text with the command line: $ libreoffice --headless --convert-to txt docname.odt
I get a new document called docname.txt
Problem is that docname.txt IS NOT ASCII TEXT…but contains unicode for stuff like bullet points!
So…not ASCII TEXT…but mostly ascii text with some unicode mixed in.
THIS IS NOT ASCII TEXT!!
Please help me out here…I want ASCII TEXT output!
Are you sure? IMHO that’s a 7 bit code, and I thought nobody uses it today.
.
Try to google for character conversion or charconv. There were a lot of command-line utilities available 20 years ago…
Dear Wanderer:
The headless Libreoffice conversion produces mixed ASCII and unicode… I would like the conversion to produce pure ASCII. Which part of my requirement did you not understand?
ASCII is a very reduced subset of Unicode, namely the 128 first glyphs. Outputting to .txt produces Unicode. If you really want strict ASCII file, you need to filter non-ASCII characters and replace them with some ASCII. For example, you complaint about bullet; do you want it to be replaced with a lowercase “o” or a period? In this case, you’ll have to do it yourself.
A tool like awk can do this. Macro generators too.
Dear Ajlittoz:
Thanks for the advice. Actually, I’ve already written some code to do exactly what you suggest.
But then again, I wonder why I need to write code! I thought a conversion to TEXT might mean that I didn’t need to write more code! Clearly TEXT means something different from what I had thought it meant! So now TEXT actually means “ASCII_TEXT + UNICODE”. Mea culpa, Mea culpa…my bad!!!
Plain text was never the same as ASCII. The latter is just an encoding (one of infinite number of encodings usable in plain text, but also in a multitude of other contexts).
Yes, in LibreOffice, we use UTF-8 by default in TXT export since recently. Earlier, we defaulted to the system encoding.
Absolutely no intention here of having an argument:
(1) I thought “ASCII” or “TEXT” meant that all characters CHAR were in the value range 0<=CHAR<=127.
(2) Unicode is expressed in two or three or four characters, some of which are >=128.
So how can unicode be considered part of the range 0<=CHAR>=127?
What is it that I just don’t understand?
ASCII - yes
TEXT - no.
This is what you misunderstood.
No. Your misconception seems also to be that you consider “Unicode” to be “everything beyond ASCII”. Actually, ASCII is just a (small) part of Unicode. So, if you talk about some kinds of Unicode encoding (e.g., UTF-8, or GB 18030), any character is encoded using one to four code units (note the “one”). In other kinds of encoding, any Unicode character (including A-Z) may be represented by e.g. four octets (UTF-32), or by two or four octets (UTF-16).
Export as UTF-7 encoded, then all octets are between 0 and 127 and Unicode characters will be represented as sequences of multiple octets. You might be a little surprised about how your file will look though…
Read UTF-7 - Wikipedia.
Usually we use it the other way round: An ASCII- and some 8-bit Character-encodings are also valid Unicode-files for UTF8. So ASCII is part of Unicode not Unicode part of ASCII.
I understood, what you asked for, but not why.
I have over 40 years converted between ISO 8859-n, and Codepages like IBM-437 or 850, so UTF8 is quite relaxing for me. And living in Europe ASCII was never a valid option for Germany, France, Sweden etc. So I don’t know any software wich today needs ASCII…
.
And Text=ASCII implies hebrew, cyrillic or japanese letters don’t form text??