Improving file conversion speed

moemen · February 19, 2025, 3:02pm

Device specifications
64-bit operating system, x64-based processor
processor 13th Gen Intel(R) Core™ i7-13620H 2.40 GHz
ram 16.0 GB
Windows 11 Pro

Im working within nodeJs and i setup a route that convert pdf to docx , it’s working for small files but with large sizes like 20MB it takes time , im not even waiting it to see how much it takes but more than 2 min ,

what can i do for making conversion faster ? any advices ?
in addition i read somewhere that i can update memory but it has been removed in the latest version (25)

idk if updating anything here can make conversion faster?

Wanderer · February 19, 2025, 3:54pm

Maybe start to describe how you do this. Then somebody may have an idea, if this can be optimized. But pdf2docx is not easy…

moemen · February 19, 2025, 4:30pm

im not using pdf2docx , i’m using libreoffice commands directly for conversion

controller

  @Post('pdf-to-word')
  @FormDataRequest()
  async convertPdfToWord(@Body() { file }: UploadPdfDto) {
    const result = await this.fileConvertorService.convertPdfToDocx(
      file.buffer,
    );
    return result;
  }

service

  async convertPdfToDocx(
    pdfBuffer: Buffer,
  ): Promise<{ fileUrl: string; outputPath: string }> {
    let inputPath: string | undefined;
    let outputPath: string | undefined;

    try {
      // Create directories and get paths
      const { uploadsDir, tempDir } = await ensureDirectories();

      // Create temporary input file
      const { fileName, filePath } = await createTempFile(tempDir, pdfBuffer);
      inputPath = filePath;

      // Determine output path
      const outputFileName: string = fileName.replace('.pdf', '.docx');
      outputPath = path.join(uploadsDir, outputFileName);

      // Execute conversion
      const command: string = constructCommand(inputPath, uploadsDir);
      await executeLibreOfficeCommand(command);

      // Verify and clean up
      await verifyConvertedFile(outputPath);
      await cleanUpFile(inputPath);

      const fileUrl: string = `/uploads/${outputFileName}`;
      return { fileUrl, outputPath };
    } catch (error) {
      if (inputPath && outputPath) {
        await handleConversionError(inputPath, outputPath, error as Error);
      }
      throw error;
    }
  }

some utilities

import { HttpException, HttpStatus, Logger } from '@nestjs/common';
import { exec } from 'child_process';
import { randomUUID } from 'crypto';
import * as fs from 'fs-extra';
import * as path from 'path';
import { promisify } from 'util';

export async function ensureDirectories() {
  const uploadsDir = path.join(process.cwd(), 'uploads');
  const tempDir = path.join(process.cwd(), 'temp');

  try {
    await fs.ensureDir(uploadsDir);
    await fs.ensureDir(tempDir);
    Logger.log('Directories created successfully');
    return { uploadsDir, tempDir };
  } catch (error) {
    Logger.error('Error ensuring directories:', error);
    throw new HttpException(
      'Failed to create necessary directories',
      HttpStatus.INTERNAL_SERVER_ERROR,
    );
  }
}

export async function createTempFile(
  tempDir: string,
  fileBuffer: Buffer,
  fileExtension: string = 'pdf',
) {
  const fileName = `${randomUUID()}.${fileExtension}`;
  const filePath = path.join(tempDir, fileName);

  try {
    await fs.outputFile(filePath, fileBuffer);

    if (!(await fs.pathExists(filePath))) {
      throw new HttpException(
        `File not created at ${filePath}`,
        HttpStatus.INTERNAL_SERVER_ERROR,
      );
    }
    Logger.log(`Temporary file successfully created at ${filePath}`);
    return { fileName, filePath };
  } catch (error) {
    Logger.error(`Error creating temporary file:`, error);
    throw error;
  }
}

export function constructCommand(inputPath: string, outputDir: string) {
  const command = `"C:\\Program Files\\LibreOffice\\program\\soffice.exe" --headless --convert-to docx --infilter="writer_pdf_import" "${inputPath}" --outdir "${outputDir}"`;
  return command;
}

export async function executeLibreOfficeCommand(command: string) {
  try {
    const execAsync = promisify(exec);
    const { stdout, stderr } = await execAsync(command);
    Logger.log('Conversion stdout:', stdout);
    if (stderr) Logger.warn('Conversion stderr:', stderr);
  } catch (error) {
    throw new HttpException(
      `LibreOffice command failed: ${error}`,
      HttpStatus.INTERNAL_SERVER_ERROR,
    );
  }
}

export async function cleanUpFile(filePath: string) {
  try {
    if (await fs.pathExists(filePath)) {
      await fs.unlink(filePath);
      Logger.log(`Successfully cleaned up file: ${filePath}`);
    }
  } catch (error) {
    Logger.error(`Error cleaning up file ${filePath}:`, error);
    // Not throwing here to avoid disrupting the main flow
  }
}

export async function verifyConvertedFile(filePath: string) {
  if (!(await fs.pathExists(filePath))) {
    throw new HttpException(
      `Converted file not found at ${filePath}`,
      HttpStatus.INTERNAL_SERVER_ERROR,
    );
  }
  Logger.log(`Verified converted file exists at: ${filePath}`);
}

export async function handleConversionError(
  tempFilePath: string,
  outputFilePath: string,
  error: Error,
) {
  await cleanUpFile(tempFilePath);
  await cleanUpFile(outputFilePath);
  Logger.error('Conversion process failed:', error);
  throw error;
}

Wanderer · February 19, 2025, 6:36pm

So this is the shell-command you construct and use:

moemen · February 20, 2025, 6:55am

yeah , but is it normal to take more than 10 min for converting a 20mb file size? can i do anything ?

mikekaganski · February 20, 2025, 7:07am

You can file a performance issue, attaching a sample PDF and the exact command line (please make it simple to others; if the problem is visible using a plain command line, don’t tell about the wrappers like NodeJS that don’t change anything except making reproduction more difficult).

You can also jump in, and try to debug and fix that yourself (though I’d suspect that this specific task is not a beginner-level one). We always welcome people who are ready to make their hands dirty in the code.

Another option is to contract someone to make this for you. This way, you also contribute to the project, because all such improvements funded by someone eventually make the project better.

But … the idea to convert PDF to a text document itself … you complain about the speed; but did you test the end result? Are you sure, that after you improved the speed to be 0.05 s, the end result would be satisfactory? Please check using a simple PDF, and see, if a document with myriad text boxes is OK for your needs. Just to have correct expectations.

moemen · February 20, 2025, 7:27am

when i try to use the command above with a file size like 1.5 mb for example
i takes 3s to being converted

but the problem when i try to upload large file (20mb) , it looks like the conversion takes too much , so i tried to do it from the command line directly not through my nodeJs app , and the same result + i can’t directly stop it using

ctrl c
i should close the windows and confirm ( it’s normal but i think it crashed )

u are asking if i improved the speed to be 0.05 s? no of course i didnt , i dont know if there’s a way to make libreoffice maybe faster or consuming maybe more memory if it’s the problem or what idk…

i though that’s related to my nodeJs app first but after trying to do the conversion from command shell directly , i became sure that it’s not related to it…

i dont know if updating anything in libreoffice advanced option can help

mikekaganski · February 20, 2025, 7:29am

No I don’t ask that. I ask, if you are certain, that you checked the conversion result (using a simpler file) and got satisfied with it, i.e., if an effort of improving the speed is worth it at all.

moemen · February 20, 2025, 7:31am

yeah of course i got satisfied with the result of simple files np at all with it , the problem only is when im trying to upload large file (20MB).

in addition look the large file that im trying to convert contains only text … nothing specifial , i downloaded it from internet to test how the application works with large file

mikekaganski · February 20, 2025, 7:32am

Then I listed your options in my answer above.

fpy · February 20, 2025, 11:40am

yep, just a regular 6000+ pages document

from pdf, its a pile of text areas.

wondering how you rate the “problem”
takes ~ 10s for 100 pages
20s / 200
but …
2min / 500
10min / 1000

so, to answer your question : split your pdfs and parallelize the conversion

moemen · February 20, 2025, 12:03pm

look actually i dont have that experience with libreoffice itself , i mean im not doing the conversion in my code , libreoffice does it and i cant control it unless there’s something that i dont know that maybe let me do what you’re saying

fpy · February 20, 2025, 12:32pm

pdftk-java / pdftk-java · GitLab

moemen · February 20, 2025, 12:51pm

im using nodeJs not java… + what im seeing is you sent a package in java that does conversion? but i dont want that , libre office should do that without using any external package…

mariosv · February 20, 2025, 11:41pm

Bug 165347 - conversion from pdf to docx takes too much and there’s crashing

Opening such pdf also takes a lot of time with GIMP or WORD.