Programmatically break image links


I use unoconv to convert files from HTML to doc, and it works very well.

One issue I’m having with it is that it creates ‘linked’ images, whereas for my purposes, I would much prefer to have the links ‘broken’ so that the images are embedded in the file.

I have been perusing the UNO documentation, but it seems quite complex and I’m not sure where to fish out the specific functionality I am after - or if it’s even there.

I’m looking for suggestions about the simplest possible way to accomplish this in an automated way, on a headless (no GUI) system.


I would imagine you require a scripted solution to achieve what you require e.g., parse the html for <img> elements, wget each image, use sed to edit each link, save the html, and finally perform the conversion. In simple terms, I am not aware of any command line functionality (that is non-scripted) that will convert:

<img src="" alt="LO logo" />

… to:

<img src="logo.png" alt="LO logo" />

… as doing so also requires downloading the related image and saving it to the directory (as indicated). You possibly have your reasons for converting to the old binary DOC format (--convert-to doc:"MS Word 97"), but it is probably worth noting that this legacy format may not handle modern html well and this problem will only become more pronounced.