Followup to sheets issue

Good Morning,

This is a follow up to the issue mentioned in How to ensure SDBC connection is closed on a connection to an excel file? - English - Ask LibreOffice
and associated issues like setting time out etc.

The follwoing problems were encountered until I found a solution to my problem. The source problem is that I have 454 excel workbooks with 18 sheets per workbook and several hundred rows of data in each worksheet. The solution that I thought would be most effective would be to open a SDBC connection to each workbook, cycle through each worksheet and inser the data into base (embedded firebird) for further data processing. I initially wrote this in Basic and while it did work, the application was extremely slow. It might be fine for a few dozen books, but not if there are more than let’s say 100 workbooks.

So I thought, let’s move to python, and while that was faster, the problem was that after about 90 workbooks, an attempt to make a SDBC connection to a workbook would always fail. I thought that if I enable pooling and connection timeout (hence the question before about setting timeout via code), this would resolve my problem. That did not work, UNO seemingly would get struck while trying to make a connection. I even tried making UNO connections and stored them in a dictionary (bad idea, don’t do that).

Since I found better performance using python, I abandoned BASIC. One way I sped up my code even more was to give up on embedded firebird and use PostGres. This helped a lot in speed and brevity of code. I needed much less code to acomplish the same task.

The problem of the UNO connection to the excel workbooks was still preventing me from properly importing all the data.So I thought instead of making a connection, importing data, and then closing the connetion, and then start the process over again (which I knew would eventually fail), why not export all sheets (this is where I asked the question about storing all of the sheets of the Excel file as multiple CSVs)in a folder and make a single SDBC flat file connection and cycle htrough all of the CSVs (8172 in total) from that connection and then close the connection and be done with it? That did not work eather, After going through about 120 csv files, LO would just quit, disappear, with no crashe message or warning. So after a few attempts of that I finally gave up with that route.

So what to do? PostGres has a copy function which speedily imports CSV data. So that means I am done using any form of UNO connection and used the pycopgg2 connection library.

So the process is:

  1. export the Excel files as multiple CSVs
  2. use postgresql to import the CSV into the data base via a psycopg2 connection
  3. continue with data processing

The time limiting factor in this solution is the conversion of the 454 sheets. Everything else was completed very quickly.

I think Pandas will do a good job of importing information from Excel files (or csv files) into PostrgeSQL.

1 Like

there seems to be some kind of issue with Pandas and LO’s implementation of Python. I know there is an extension for Pandas. I will take a look at that eventually.

But for converting from excel to csv or Postgres with Pandas, why make this more complicated by solving this inside LO. Use Python/Pandas directly without LibreOffice…

2 Likes

Security policy. The version of LO at work is the PortableApp version. No need for admin priveleges to install, etc.

I did a test just a few minutes ago. I have large amounts of data, so I checked the postgresql copy performance against pandas. Pandas did not nearly have the performance that Pandas did. So while Pandas is useful, it would not be appropriate for my current work. Postgresql copy it is.