How can you extract text from a pptx?

lesshaste · October 17, 2013, 5:18pm

I have tried the command line libreoffice --headless --convert-to txt:Text file.pptx but this seems to do nothing at all. From within libreoffice I can save as html but largely makes images of the slides where I want the text.

dajare · October 17, 2013, 7:25pm

I’m not sure if it is exactly what you want, but there is a technique for getting text from Impress, although it is rightly described as “low-tech”.

Btw, I assume by “pptx” you mean a “presentation”, i.e., Impress document in LibO. If you really do mean MS’s Powerpoint .pptx … you’re at the wrong site. You can also have a look at this StackOverflow Q&A on ppt/pptx text extraction to see if there’s any help there.

lesshaste · October 17, 2013, 9:25pm

Thank you. I did mean powerpoint .pptx which as I am in linux I import into libreoffice to view. I don’t have microsoft office at all.

oweng · October 19, 2013, 12:51am

Under Linux you could script a solution using unzip and grep etc. For example, this will pipe the contents of slide 1 to the screen:

$ unzip -p /path/to/my_pres.pptx ppt/slides/slide1.xml

You should be able to loop through slide numbers and grep paragraph text / list items from that.

dajare · June 2, 2014, 7:27pm

See also now a related Q&A which includes the steps needed to use the “grep” solution.

AlexKemp · November 12, 2015, 1:20am