How can you extract text from a pptx?

I have tried the command line libreoffice --headless --convert-to txt:Text file.pptx but this seems to do nothing at all. From within libreoffice I can save as html but largely makes images of the slides where I want the text.

I’m not sure if it is exactly what you want, but there is a technique for getting text from Impress, although it is rightly described as “low-tech”.

Btw, I assume by “pptx” you mean a “presentation”, i.e., Impress document in LibO. If you really do mean MS’s Powerpoint .pptx … you’re at the wrong site. :wink: You can also have a look at this StackOverflow Q&A on ppt/pptx text extraction to see if there’s any help there.

Thank you. I did mean powerpoint .pptx which as I am in linux I import into libreoffice to view. I don’t have microsoft office at all.

Under Linux you could script a solution using unzip and grep etc. For example, this will pipe the contents of slide 1 to the screen:

$ unzip -p /path/to/my_pres.pptx ppt/slides/slide1.xml

You should be able to loop through slide numbers and grep paragraph text / list items from that.

See also now a related Q&A which includes the steps needed to use the “grep” solution.