Batch Convert, pass folders in parallel

oerdem19 · January 23, 2024, 1:27am

Hi, I am trying to convert thousands of rtf docs to txt by passing the folder info. I keep getting file “Error: source file could not be loaded”
When I pass single file, it works.
Here is the code that I use to pass folder and get an error.:
d = tempfile.mkdtemp()
subprocess.run([‘soffice’, ‘-env:UserInstallation=file://’+d, ‘–headless’, ‘–convert-to’, ‘txt’, ‘/home/oe/DesktopAudioFiles/Test/*.rtf’’, ‘–outdir’, directory_list ])

When I try it with file :
d = tempfile.mkdtemp()
subprocess.run([‘soffice’, ‘-env:UserInstallation=file://’+d, ‘–headless’, ‘–convert-to’, ‘txt’, ‘/home/oe/DesktopAudioFiles/Test/1.rtf’’, ‘–outdir’, directory_list ])

I am trying to understand why I cannot pass wildcards as a name.

Second question is, when I use individual names and parallelize it with python multiprocessing, it hangs in the middle of the process, no error just hangs. Anyone has a suggestion regarding how to convert rtf to txt as fast as possible.

Thank you,

oerdem19 · January 23, 2024, 3:49am

I did find my solutions. (chatgpt)
Here is the code that I got.
Credit goes to the author in

#!/bin/bash
shopt -s nullglob
baseDirectory="/mnt/h/AudioFiles/Test/Test4"

# Check if directory is specified.
if [ -z "$1" ]; then
    echo "No input directory specified."
    exit 1
fi

# Copy files to directory, each directory contains max of 200
# This is done because Linux has a shell limit of 249 with LibreOffice.
rtfs=($1/*.rtf)
n=0
for ((i=0; i < ${#rtfs[@]}; i += 200)); do
    printf -v b "${baseDirectory}/balancer/%03d" $((++n))
    mkdir -p "$b" && cp "${rtfs[@]:$i:200}" "$b/"
done

# Function to convert files and log time
convert_files() {
    local d=$1
    local id=${d: -3}
    local start_time=$(date +%s)

    soffice "-env:UserInstallation=file:///${baseDirectory}/environments/${id}" \
            --headless --convert-to "txt:Text (encoded):UTF8" "$d"/*.rtf \
            --outdir "$1" > "${baseDirectory}/output_${id}.log" 2>&1

    local end_time=$(date +%s)
    local duration=$((end_time - start_time))
    echo "Conversion of batch $id took $duration seconds."

    if [ $? -ne 0 ]; then
        echo "Error in converting files in $d" >> "${baseDirectory}/error_log.txt"
    fi
}

# Process in batches of 10
PIDS=()
count=0
for d in ${baseDirectory}/balancer/*; do
    ((count++))
    convert_files "$d" &
    PIDS+=($!)
    if [ $count -eq 10 ]; then
        wait "${PIDS[@]}"
        PIDS=()
        count=0
    fi
done
# Wait for any remaining processes
wait "${PIDS[@]}"

echo "Conversion done"

mikekaganski · January 23, 2024, 4:50am

I believe that what you see is lack of glob expansion in LibreOffice, which relies on that happening in shell that passes the *.foo specified in the command line as 1.foo 2.foo 3.foo to the program. So there is no code to process * or ? in filenames in LibreOffice on Linux; and when Python passes /home/oe/DesktopAudioFiles/Test/*.rtf to LibreOffice verbatim, LibreOffice tries to open literally that file (which is a valid filename on Linux, by the way: only / and \0 are prohibited on kernel level).

It was also the case on Windows, until recently. Since standard Windows shell does not expand wildcards, this was always a problem using LibreOffice (and OpenOffice.org) to process whole directories (tdf#48413). So the solution was to handle wildcards on Windows explicitly - which was an OS-specific change, not altering Linux/macOS/etc.

And indeed, the old solutions that were previously suggested to Windows users, now can be applied to Linux et al - i.e., to expand the globs in your code, and then pass to LibreOffice. Or - as you did in your own solution, just use a shell script (relying on its glob preprocessing), instead of a Python script.