While copy paste malayalam text from PDF file to Libreoffice writer some different characters are showing

Deeps · October 17, 2019, 12:36am

I tried to copy some Malayalam text from PDF file and pasted the same to Writer. I am able to paste the contents but the language is something else.

Malayalam text has changed to the following text

\mcpthcv
Cu ] ́nIbn¬ \n∂pw \ofØn¬ Iodm≥ Ignbp∂ CeIfpsS t]scgpXpI

Please give me a way to solve this.
Unable to attach PDF document. So take a screen shot and attaching

Thanks

gabix · October 17, 2019, 7:37am

Perhaps, the file uses some non-Unicode font or non-standard encoding (and you don’t share a sample file to let understand). In this case, you probably can’t do anything except for feeding the file to an OCR program capable of recognizing Malayalam texts.

LibreTraining · October 17, 2019, 2:20pm

Depends on how the original app that created the PDF encoded the glyphs/characters.
Many applications will insert the correct glyph(s), but the Unicode codes behind those glyphs is missing or wrong.
When you cut and paste the text, the underlying codes is all that the target app sees.
So it appears your text is not Unicode coded correctly.

This can be confirmed with a sample PDF from the same source.

Also depends on if the font is OpenType.
Also depends on if the font is embedded in the PDF.

UPDATE:
Took a look at the Noto Sans Malayalam font to get an idea of how the script is constructed.
Highly unlikely you will get cut-and-paste from a PDF to work.
There are a large number of ligatures which are un-mapped (no Unicode point).
Depending on the font used,embedding the font, and the PDF writing app capabilities you may have success, but it is unlikely.

If the PDF does not have the correct mapping and codes, the language code in LibreOffice is irrelevant.

.

petermau · October 17, 2019, 4:32pm

You do not say, but * am assuming that the Latin Characters copy across as normal and it is only the Malayalam characters are the ones giving the problem. *

If you are cutting and pasting from say a Windows system the character set will default to the setting of your system which is 8-bit ISO-8859-1 in English-GB. It must be set to allow Malayalam, If not each Unicode Malayalam character will appear to be three -1 characters. In LibreOffice you must also set the default language in LANGUAGE SETTING> LANGUAGES to support complex Text Layout.

Sorry, do not have a Windows system to find out how to control the settings. My system just defaults to Unicode.
I hope this may provide some additional explanation.

At first glance, Unicode can seem confusing because there are four different forms.
Unicode is a 32 bit standard that theoretically supports 1,114,112 code points. Currently about 138,000 are used which lives a little bit left for future expansion! It can be encoded in three forms, UTF-32 using 32 bits, UTF-16 using 1 or 2 16 bit characters or UTF-8 which is the normal LibO default cleverly using 1,2,3 or 4 8-bit characters.

Unicode encloses four different character sets.

US-ASCII 7 bits is contained in the first 127 characters (x’00’ to x’7E’) (1968 standard)

ISO-8859-1 8 bits the first 255 characters (x’00’ to x’FF’) adds Western Latin including French and German accents, £ sign etc. (1987 standard)

Basic Multilingual Plane 65,536 code points, most of the world excluding Chinese Japanese but including Malayalam and the € (EURO) sign.

UTF-8 is the most compact form where Latin is the basic character set in use, but least efficient when you are using the majority of characters not contained in the first 256 characters.
If you display characters such as Malayalam as Latin, each character will appear to be two three or four what appear to be random characters as in the example above.
One quick test to see if this could be the problem is to use the € ~(EURO) character, which is why a large number of users cannot display the € sign and type EURO instead.

LibreTraining · October 18, 2019, 7:54pm

You are missing the point.
He is not copying a bunch of Unicode characters.
He is copying glyphs/characters which may or may not have a Unicode code point mapped to them.
The PDF should have a ToUnicode table in it which maps the glyphs/characters to the proper underlying Unicode code points.
Many Malayalam “characters” are actually ligatures made-up of four or more different glyphs.
Some of those glyphs may have a Unicode code point, many/most will not.
This depends on how the font is constructed.
When you highlight text in a PDF to copy it, you are relying on the ToUnicode table to get correct info.
That info simply may not exist.
How the writing application creates the PDF greatly affects this.
If the writing application is very sophisticated, and it embeds the font parts needed, it MAY work.
This process is very difficult when importing the PDF to a high-end application.
A cut-and-paste to a word processor is not likely to work.
It’s not language settings.

jimk · October 17, 2019, 4:45pm

Although we can’t be certain without seeing the original file, It sounds like the PDF may not have been created using a Unicode font. In that case, either obtain the original legacy font, or else convert it to Unicode.

There is encoding conversion software such as TECkit that allows you to write mapping rules to convert legacy to Unicode encodings. You might also get lucky - there are Malayalam maps already available for TECkit, so one of them may work. The difficult part is to determine which font encoding your PDF uses, as there are many different legacy encodings.

Also, I heard the following solution from a friend but have not tried it: Put the PDF into Google Drive, then right-click and open with Google Docs which will scan and convert it into editable text. This seems similar to OCR, which means you do not need a conversion map but some details are likely to be incorrect, so check the result and fix any problems.

Deeps · October 23, 2019, 1:07am

Unable to open with Google Docs