Arabic letters get reversed

typeo · December 31, 2021, 7:30pm

Hi,
When I try to convert my PDF to Docx, Arabic letters get reversed.
for example سلام becomes ملاس
what should I do?
I use this command:

libreoffice --infilter=“writer_pdf_import” --convert-to docx test.pdf

can we specific a Unicode or something like that?

Also I’m using 7.2.3.2 version.

thanks.

Hrbrgr · January 1, 2022, 8:21am

Asian Typography

Asian Layout

Languages

Can that help?

typeo · January 2, 2022, 5:41pm

Thanks @Hrbrgr,
Can I change these properties (mentioned in the given links) from CLI?
I need all the process to be done from CLI.
Thanks.

Hrbrgr · January 2, 2022, 5:48pm

No idea, maybe someone else will answer, sorry - or →

have a look at: Starting LibreOffice Software With Parameters

typeo · January 4, 2022, 2:00pm

Thanks for your time and consideration
I didn’t find something for solving my problem.
Maybe I need to use MS word.

KamilLanda · January 4, 2022, 2:25pm

Can you upload the example? Short PDF or ODT or what is your source file.

typeo · January 4, 2022, 2:59pm

Yes, this is the file:
https://m.typeo.top/persian.pdf
and the converted file is:
https://m.typeo.top/persian.docx

the used commend:

libreoffice --infilter=“writer_pdf_import” --convert-to docx persian.pdf

I tested different files and the result is same.
Thanks

KamilLanda · January 4, 2022, 6:04pm

It is maniacal PDF. Initially it seems OK in FoxitReader.

I’m not sure but I suppose the LamAlif is the same in Persian like in Arabic. I don’t know if the Calibri and Times New Roman used in PDF are trusted fonts for Arabic/Persian. And I haven’t (and don’t know) the Mitra font also used in PDF.
img2-lamAlif

I tried to save the PDF as TXT in FoxitReader. It saved only 1st page. I tried to save it to TXT in SumatraPDF. It saved all pages, but reversed for some characters.

I opened PDF in Draw. Debacle. Overlaps of the texts.

I opened your DOCX in Writer. Terror of the user! Textboxes (boxes with red triangle) and Frames with overlaps.

I did Ctrl+A Ctrl+C in Foxit Reader and Ctrl+V as unformatted text to Writer. It joined the text from 1st page with 2nd page.

My result is you have two unpleasant choices if you want to use Writer:

copy the text from PDF viewer manually (mabye paragraph by paragraph) and Paste it to the Writer as unformatted text. Then the corrections manually.
Or try to open it in Word (you wrote it is OK) and save it as some simple format (best only simple TXT). Then try to open the TXT. Or Ctrl+C from Word and Unformatted Paste to Writer.

Apprently Word put into PDF some baffling bordello and unfortunately I don’t see the chance to convert this PDF by some normal way.

typeo · January 4, 2022, 6:27pm

Thank you very much
It has nothing to do with fonts.
I tried different fonts but it still can’t convert Arabic and Persian words.
By the way it converts English words inside a given doc with no problem.

ajlittoz · January 4, 2022, 6:45pm

@KamilLanda: in Draw, PDF text overflow in margins (and red triangle) are symptoms that the original font is not installed. A substitution font was used but it hasn’t the same metrics as the original. Thus, character sequences have not the same width. Notice also that glyphs have not the same height (again because of different metrics) and get clipped by the text box bounding rectangle.

ajlittoz · January 4, 2022, 3:54pm

Your ".docx file is not a “text document”. It only contains (editable) empty paragraphs which are necessary to create the pages and (non-managed) graphical objects. By “non-managed” I mean they don’t form a text flow and can’t be styled.

You would have got practically the same result if you opened your PDF directly into Draw.

Remember that PDF a page layout format. In other words, it describes how shapes are positioned on a page. And in this context, letters are nothing but shapes (glyphs). Therefore, there is no structure between letters, just various shapes put onto the paper. And in the file, there is no obligation to list them in reading order which makes even more difficult to rebuilt the text flow.

I understand that the “conversion” process, considering the PDF properties, just scanned graphically the file, building graphical boxes inside which it incorporated the letters as they were met, without any consideration for the script. In the end, the letters are accumulated in the wrong order. Later, when Writer renders the text boxes, it notices the letters belong in the Arabic script and displays them in the “right” order but the words are already botched.

These text boxes are not frames (which could be controlled and formatted with styles) but simple graphical objects which can’t practically be edited/formatted.

To make things worse, spacing in the PDF is also converted as empty drawing objects as is examplified in the first half of the first page by Shape1, Shape3 to Shape6 with only Shape2 and Shape7 containing meaningful data. It continues like this beyond the first half.

From personal experience, the only reliable conversion from PDF to .odt is manual. You open both your PDF original in a PDF viewer and a blank Writer document. When in “text selection mode” (otherwise you’ll copy an image), you copy a block of text from the PDF viewer and you paste it into Writer as unformatted text.

You’ll get as many paragraph as lines. You must then reconstitute the logical paragraphs by deleting the extra paragraph marks at end of line, keeping only the final mark at end of paragraph.

You must also restyle the text.

If you don’t need to edit the text, keep it as PDF to avoid all this trouble.

typeo · January 4, 2022, 4:11pm

Thanks for your accurate answer.
This PDF file has been made by MS Word.
I can open it easily by MS Word and It converts it with no problem.
I thought maybe I can do the same thing with LibreOffice.

ajlittoz · January 4, 2022, 4:44pm

Perhaps Word added tags or other markup when converting to PDF so that the process can be reversed. But these possible additions are private to Word and Draw/Writer has no clue about their existence and meaning.

anon87010807 · January 4, 2022, 6:00pm

I found that copying all text from that PDF and pasting it as unformatted text into a blank Writer document did the job perfectly, as far as I could see. So all you need is a script that opens the PDF, copies its content, pastes it into a blank Writer document as unformatted text and finally saves it as a Writer file.