When I copy text from pdf file there is no spacing between text?

I am trying to copying the text from pdf file. Each time I tried to copy text from the pdf file it does not give spacing between text. However when I copy text from the internet its work fine. For clarification here is the link below which gives much general idea:

Above is the text which I have copy from pdf file

We have a similar problem with source code that is published in the pdf guides. For example, the bottom of Getting Started Guide p431 has sample python code. When I copy and paste it as a text file the blank line is lost and any line with leading spaces has a single space before the code. Tabs are ignored. Code is formatted in Liberation Mono, a fixed-pitch font, with 4 spaces per indent.

import uno
def HelloWorld():
 doc = XSCRIPTCONTEXT.getDocument()
 cell = doc.Sheets[0]['A1']
 cell.setString('Hello World from Python')
 return

This is important for a language like python because indents are part of the syntax.

Any thoughts of how to encode these documents so spaces are retained or where to find an answer?


But of course, pdf can be rendered with space characters encoded, and LibreOffice controls how the page is rendered when using export as pdf.
.
[Seems I’ve lost the ability to comment on posts in this topic over the last 2 hours.]
.
Method of capturing the text is important. Exporting as pdf from writer with default settings retains the leading spaces but drops blank lines.
Checked using: java -jar pdfbox-app-2.0.25.jar ExtractText test.pdf test1.txt
.
Version: 7.3.1.3 Win10 en-AU


Using https://pdf-xchange.eu/ leading/trailing spaces encoded in the pdf file can be copy/pasted to text files.

PDF is a page description format. It can render the page anyway is convenient to give an equivalent display. This means all “blank spacers” can be ignored provided the string are directly position to the correct coordinates. PDF is a “dead-end” encoding, a “final-step” one where the frozen document is not supposed to be further processed. Consequently there is no PDF solution to your problem. However, you may try extracting the code snippet from the .odt version which is also downloadable from the documentation website.

1 Like

I doubt it. A document processor like Writer considers space characters in source typing as separators between linguistic units. These separators can be expanded or shrinked to allow for justification. This can’t be done with U+0020 SPACE or U+00A0 NO-BREAK SPACE. It could be very approximately simulated with the spaces at U+20xx but this is imperfect and consequently never done.

Writer is not a basic text editor where all characters are used as defined by the font. Formatting directives, such as style attributes, give instructions on how to interpret spaces.

So, perhaps, forcing the paragraph styles to left alignment could cause spaces to be rendered as U+0020 SPACE but I didn’t check that. U+0009 TAB will never appear in any output because it must be translated to some position defined in paragraph style properties (in a document processor, contrary to a text editor, tab stops are not necessarily evenly spaced).

Anyway, this doesn’t solve the problem with any non-personally produced PDF. As I suggested, try with the .odt version of the Getting Starting Guide.

The nick Bob_Niland__Error_7103_ in the discussion at the link

proposes the following reasonable reason:

The problem is that the PDF file may or may not have the spaces encoded as space characters, particularly at ends, but also between words and perhaps even characters. The rendering engine (Ps or PDF driver) may have chosen to break “word1 word2” into two strings with two starting coordinates and no U + 0020 space character (or alternative space characters) at all.

As for internet copying, the web pages are written in html and loaded in the browser as a text file (see it from the menu, source document item). From this text the browser derives the dom object structure then used for visualization

It is a completely different case from copying from a pdf. In a web page space are coded as space breaking or not breaking (  )

I don’t think it’s a topic for this forum though

There may also be an issue with the PDF viewer. Okular in Linux with KDE desktop makes a difference with the selection tools. There are 3 of them: one for “just selection” (usaully resulting in an image), one for text and one for table. With the proper tool I haven’t yet encountered this problem.

Nevertheless, your explanation seems right.

It is possible that haziq has copied directly from a browser that displayed the pdf or from a pdf reader not predisposed to distinguish the underlying content based on the characteristics

I agree with you, the pdf browser makes the difference.

I am able to fixed this problem by convert PDF file in to Word format then I export the file into PDF format which does a trick below is the link of copy text which I have copy without any issue:

Following are the steps which I have taken:

  1. I upload the PDF file to this link https://pdf2doc.com/ which then convert the file to word format.

  2. Then I open the converted file in to MS word then export it again to PDF format which solve my problem.