Pdf to Draw to Writer

I have read the following links:

43733/pdf-import

45385/how-to-convert-a-pdf-file-to-plain-text

30279/how-do-i-convert-from-pdf-or-draw-to-writer-instead-of-draw

and various other documents. I find that there is a missing link between Draw odg file and Writer file.

Can this Missing Link be joined please?

LibreOffice Drawing file is extremely resource-hungry. Nearly brings the system to a standstill. Can’t be used for regular editing of a normal pdf file even.

Version: Version 4.0.3.3 (Build ID: 400m0(Build:3)) in Knoppix 7.2.0

Later Editing and Addition on 13th Feb 2021:
Because of your collective efforts, especially Mr. Lupp, editing pdf files from LibreOffice Draw is now very easy. The only problem with pdf is that pdf file format uses codes to place each word or line with reference to that page of the document, and not as a textbox spanning an entire paragraph. However, this is not particularly a nagging problem and could be bypassed.
Thank you all! So the users’ inputs also matter in improving a package, it appears :slight_smile:

See How to import pdf in Libreoffice 7.0 (at LibreOffice 6.4).

===Edit2 2022-02-16===
Starting with V7 a somehow unclean way to use a variable in the code contained in the example attached to “Edit1” is no longer supported. Though I don’t use the code myself, I stumbled by accident over the issue and fixed it. Since my answer here was upvoted more than once, I assume there are users. They should hence only use the code contained in the new attachment. It should also work in older versions of LibO (tested down to V 4.4.7).
drawTexts2WriterV7.odg (17.1 KB)
===Edit2 End===
===Edit1 2018-09-11===
I came back to this old thread accidentally, and as I had analysed the task since again out of interest, and written and arranged some code for it, I now attach this demo containing that code. The main aspect was moved from the pdf files to the ‘Draw’ document that may or may not be created by opening a pdf. In fact the demo shows some graphic shapes created in Draw and containing text, some of them grouped…
The main enhancement is that now the DrawingDocument, its DrawPages, the ShapeColection(s) and the Shape(s) contained therein are recursively resolved. The contained text is collected first in an array together with information about the original position of the particular shape, and then sorted based on that information in advance of the output to a new TextDocument. Thus the order of text is no longer defined by the logical order of the shapes, but by the visual position basically. To do this even more precisely would require a few corrections to the coordinates of position based on the shapes’ properties.
You may play with the example rearranging the shapes on the second slide and compare the results.
The included sorting algorithm was written from scratch. It isn’t optimised concerning the sorting itself but reducing the number of needed transpositions. (It also is not optimised for the specific task where no transpositions at all would be needed.)
As always interested in criticism and suggestions.
(I personally never made much use of the proceeding.)
The code posted below is from my original answer. It was not changed during editing.
===Edit1 End===

Even when opening a pdf the way @mikekaganski pointed to, it will be “unmanageable” as he also told. In very rare cases where you urgently need to import the text content of a pdf without any formatting, you may open it in LibO Draw and then apply a “macro” collecting the texts.

Since simiiar requests reoccurred now and then I once sketched a very raw piece of code for the purpose in BASIC.
You may use it and enhance it as needed at your one risk.

REM  *****  BASIC  *****
REM Wolfgang Jäger (Lupp); 2016-09-05; Copyleft 0 
Option Explicit

REM This procedure was sketched because questions about moving the textual
REM content from pdf files opened in 'Draw' into an actual text file come up
REM now and then, and there was not offered a solution yet, as far as I know.
REM 
REM Of course, this provisional code cannot replace a thorough solution
REM to the problem (if actually needed at all).
REM In specific there is NOT MADE AN ATTEMPT TO RESOLVE GROUPS or to process
REM the 'Draw' objects regarding their position. The sequencing of texts goes 
REM along the logical order of the objects.
REM For a PDF automatically imported by 'Draw' this should work.

Sub experimentalExportTextFromDrawToWriterDoc(optional pNum as Long)
	Dim doc0 As Object, page As Object, shape As Object, shapeText As String
	Dim doc1 As Object, tText As Object,vCur  As Object, tCur      As Object
	Dim i As Long, j As Long, k As Long, m As Long, n As Long, low As Long, high As Long
	Dim location As String, newLocation As String, alert As String
	Dim unresolvedSignal As String
	unresolvedSignal = "%&@~+!!\µ~*?§" REM Arbitray string not occurring somewhere else in the universe!

doc0 = ThisComponent
If NOT doc0.SupportsService("com.sun.star.drawing.DrawingDocument") Then Exit Sub
If IsMissing(pNum) Then pNum = 0
m = doc0.DrawPages().Count()

If (m<pNum) OR (pNum<0) Then
	MsgBox "No page "+pNum+" available!"
	Exit Sub
End If

location = doc0.GetLocation
newLocation = location+".odt"

If FileExists(newLocation) Then 
	alert = "Warning! The destination file "+Chr(13)+ newLocation+Chr(13)+ _
	"already exists. Please delete or rename it before calling this procedure again!"
	MsgBox alert
	Exit Sub
End If

doc1 = StarDesktop.LoadComponentFromUrl("private:factory/swriter", "_blank", 0, Array())
doc1.GetCurrentController().GetFrame().GetContainerWindow().SetVisible(False)
doc1.StoreAsUrl(newLocation,Array())

tText = doc1.getText()
vCur = doc1.CurrentController.getViewCursor()
tCur = tText.createTextCursorByRange(vCur.GetEnd())

low = 0 : If pNum > 0 Then low = pNum-1
If pNum=0 Then 
	high = m - 1
Else 
	high = pNum - 1
End If

For i = low To high

	tText.insertString(tCur, "------PAGE "+(i+1)+ "------", False)
	tText.insertControlCharacter(tCur, com.sun.star.text.ControlCharacter.PARAGRAPH_BREAK, False)
	tText.insertControlCharacter(tCur, com.sun.star.text.ControlCharacter.PARAGRAPH_BREAK, False)

	k = 0
	page = doc0.DrawPages(i)
	n = page.Count()
	For j = 0 to n - 1
		shape = page.GetByIndex(j)
		shapeText = unresolvedSignal
		On Error Resume Next
		shapeText = shape.Text.String
		On Error Goto 0
		If shapeText = unresolvedSignal Then
			k = k + 1
		Else
			tText.insertString(tCur, shapeText, False)
			tText.insertControlCharacter(tCur, com.sun.star.text.ControlCharacter.PARAGRAPH_BREAK, False)
		End If
	Next j

	tText.insertControlCharacter(tCur, com.sun.star.text.ControlCharacter.PARAGRAPH_BREAK, False)
	tText.insertString(tCur, "There were "+k+ " unresolved objects on page "+(i+1)+".", False)
	tText.insertControlCharacter(tCur, com.sun.star.text.ControlCharacter.PARAGRAPH_BREAK, False)
	tText.insertControlCharacter(tCur, com.sun.star.text.ControlCharacter.PARAGRAPH_BREAK, False)
Next i
doc1.Store
doc1.Close(True)
End Sub
1 Like

lupp, Sir! You are just marvellous! Other users also need to know about this code as well!
I will try it ASAP and get back to you. To enhance the code I would need to have the expertise to understand it first! I will try to, but definitely try!
Warm regards.

@lupp would it be possible to convert this function into some kind of extension “Join selected text frames into one paragraph” for semi-automatic conversion? That would be even more useful, IMHO.

Also, https://bugs.documentfoundation.org/show_bug.cgi?id=118370

Quoting @mcepl: “…would it be possible to convert this function into some kind of extension “Join selected text frames into one paragraph” for semi-automatic conversion?”
Doubting if the one-paragraph thing as the only option would be a good idea, the answer, however, is yes. Any code running in LibO based on its API should allow conversion to an extension. Simply do it.
I personally am not an author of extensions.
See https://wiki.documentfoundation.org/Development/Extension_Development

Anyway: The code I presented in my above answer, and also the code contained in the attachment to the later inserted part there does not handle text content of the TextFrame type in Writer. It works on shapes.

Alert
Concerning the code supplied with the example in ===Edit1 2018-09-11===

I didn’t use the code for a long time now. (It wasn’t made for myself anyway.)
In pursiuit of a diiferent task I came back and tried it again with the annoying result:
The code did work in Libreoffice up to 6.4.5.2 for me, but no longer in V7.0.1 or higher.

While I do agree with @JohnHa that Writer isn’t good for editing PDFs, I do have some positive experience with Draw and simple PDFs.

If, however, you need to use Writer to open PDFs, then you don’t need an intermediate step (Draw), but instead you need to use LibreOffice’s FileOpen... dialog, and select PDF - Portable Document Format (Writer) (*.pdf) in file type drop-down list. This will import the PDF to writer right away… and you will surely see that it’s unmanageable as well…

1 Like

Oh! Then it’s unmanageable?!

Unfortunately yes. It’s usually imported as separate text boxes, and it’s very difficult to edit it unless you intend to replace one character with another… :frowning:

If you need to edit a PDF buy Adobe Acrobat.

Anything else is a very poor workaround and you will tear your hair out with frustration. You will probably kick your poor cat to death as you struggle for hour after wasted hour.

No ! There’s absolutely no need to pay for Acrobat when a free software will do the same !

Dear Sir rautamiekka, could you please elaborate?

" … free software will do the same"

That is incorrect.

Only Adobe Acrobat gives full editing of a PDF. Free software PDF editors are very limited in what they can do and cannot do even basic things.