Formatting issue after doc to html conversion

hi guyz,
im using drupal/php and libreoffice for converting doc to html but indentation is not proper image formatting is also poor
reference docx:- https://calibre-ebook.com/downloads/demos/demo.docx

im using this command :-
‘/usr/bin/soffice --headless --convert-to html:HTML:EmbedImages’ . $source_path . ’ --outdir ’ . $destination_path;

OS :- NAME=“Fedora Linux” VERSION=“36 (Server Edition)”

LibreOffice 7.3.7.2 30(Build:2)

not able to upload .html file that’s why uploading in .doc demo.doc (102.7 KB)

any help will be great full.

Apparently your question is not at all related to the database frontend Base. So, I took the liberty to change tag from base to writer.

Please, edit your question (don’t use comment at this stage so that all details are in the same post) to mention OS name and LO version.

Describe your procedure: what is your source (the .docx document?)? what is doing what?

  • .docx ==drupal==> .odt ==writer==> .html
  • .docx ==writer==> .odt ==drupal==> .html
  • or some other procedure?

Does your screenshot display the HTML result or the .odt converted file?

1 Like

As drupal is a php-based content-manager for websites I assume there is PHP generating a call to LibreOffice in headless mode to convert “directly” from .docx to .html
.
So 3 sources of error: converting of alien docx to internal writer representation, then conversion to html, last the possible interference with css generated on the drupal side to display the html…

@ali.rizvi Thanks for updating the question, but still not enough information. Attach the converted HTML for analysis. Don’t forget the .css file if relevant.

1 Like

The converted document is not a full-fledged HTML one. It only contains translation of the DOCX text but lacks the “structure” or “skeleton” of a web page.

The following is missing:

<!DOCTYPE html>
<html lang=… xmlns=…>
<head lang=…>
  <title>Page title</title>
  <meta …>
  <link href="….css" rel=stylesheet type="text/css"> <!-- this is very important for rendering -->
</head>
<body>
    <!-- this is where the translated document should be inserted -->
</body>
</html>

Bad paragraph formatting comes from the absence of a CSS stylesheet because styling is not written in every HTML element (which would be equivalent to direct formatting with a dramatic increase in file size.

Bad placement of images also results from the lack of the CSS stylesheet which contains placement directives.

Check the modus operandi of your Drupal conversion. You probably forgot one step.