Overall strategy for importing new data in .csv files into existing embedded LO database

Published some Basic snippet which I use on a daily basis to import different flavours of csv into the same database table.
[Basic, SQL] CSV import into some database
This code snippet does not help anybody who can not analyse a text file and write the SQL which is needed to convert specific text files into valid data types of a specific target table. The major part of the programming needs to be done in SQL. All details depend on the exact details in the csv and the database engine in use.

P.S. The LibreOffice version 3, 4, 5, 6 or 7 does not matter. Base did not change very much since 2005 (OpenOffice.org 2.0). Everything depends on the csv details and the database engine in use. A Base document can be connected to a wide variety of different databases.

I’m sorry @Villeroy: You are partially correct that there are compleixities in the variety of formats that various organisations describe as .csv.

I still disagree strongly with your view that understanding CSV is the WHOLE of the problem. As discussed elsewhere I had a spreadsheet where the dates worked as dates in that spreadsheet but where when copy/pasted into Base those columns were only recognised as text; needing a View simply to turn those columns back into dates. That was NOT a .csv issue because within the spreadsheet data arithmetic worked fine (+1 for tomorrow, etc)
For many users, especially if they have LO set up for their own country’s locale AND if the organisation generating the .csv file also has a similar locale then LO successfully hides those complexities (most of the time). Such lucky users can often get away with selecting a few relevant options to tick on importing from a csv. As indeed I did for many years.

Until, of course, some overlooked detail jumps up and bit me… as happened to me in late 2022 after years of the same technique always working. When that happens it may or may not be necessary to dig into .csv complexities: but when that happens please do not close other people’s minds to the possibility that there may be other issues besides. LO is a hugely complicated collection of software, and for historical reasons Base remains less well integrated into LO than (for example) Access is within MS Office.

You are clearly our resident wizard on .csv issues: but just because you are an expert on hammers please do not assume that everything is a nail. Screws hold better if driven in with a screwdriver rather than hammered home (although hammering a screw home apparently works, carpenters tell me the result is less strong than using the correct tool)

Having said that: your expertise has contributed to this discussion: and I am especially grateful to you for your pointer to the CREATE TEXT TABLE extension available in HSQLDB. That creates a third way forward.

For an “overall strategy” the concrete csv and the database engine are the only things that matter. Without specifying the database engine we do not even know which software we are discussing.

There is no need to create any text table when you copy correctly imported data from Calc into Base. In case of embedded Firebird, any dates need to be numbers formatted as ISO dates.

  1. Import correctly into Calc. This seems to be a major hurdle for many users.
  2. Copy the correctly imported data, select the target table and paste.

The strategy without Calc is a lot more complicated but in the long run easier to use.

I think that “embedded LO database” in the title is fairly explicit about what datbase engine we are using. Nobody is using embedded Firefox these days, and many current Base users do not even remember it. Unless you know of plans for a further change in the embedded engine the use of HSQLDB seems a stable assumption.

Please avoid mentoring answerers here, teaching them what should they assume. Most regulars here have enough experience to finally stop assuming anything, and to start requiring explicit statements from askers about the important details - just because everyone here is too familiar with misunderstandings coming from assumptions on the both sides.

Please don’t act as if others should also assume any (any) level of your experience; it’s just a simple matter of asker’s duty to supply as much information as possible, and then some more, when explicitly asked for clarifications.

  1. It seems that @Villeroy keeps arguing without really trying to read and comprehend the post, like “I saw mentions of CSV and database and dates, and I don’t even need to read to know where the problem lies”. Namely:

This is simply a barebones of the original post, where steps 1 to 3 explain a method to import a CSV correctly (as per @Villeroy’s #1), and steps 4 to 12 explain @Villeroy’s #2 (modulo creating the table in the process, vs. pasting into an existing table).

The very idea of @trueriver was to expand on how important it is to do steps #1 and #2 correctly; and continued disagreement from @Villeroy looks simply inappropriate, ignoring the opponent’s text in its entirety.

  1. Note some of the related issues that intend to help users discover the controls for correct CSV import: tdf#152336, tdf#152337, tdf#150652.
1 Like

It seems there are three strategies I can choose among, or indeed four if I include the zero option of continuing to copy/paste from Calc.

  1. Load into Calc using the various options when opening a .csv file there: paying particular attention to getting the separators and date formats recognised correctly. Select all, copy/paste into Base creating a new table or appending to a suitable existing one

  2. Create a suitable folder to contain the .csv file or files and then create a “databse connector” file to that folder by using the options in the dialog when opening Base. That makes LO regard that folder as a read-only text database. This works best when the locale set for LO matches (or is close enough) to that of the organisation that generated the .csv file/s – then use copy/paste to move that data from the intermediate database to the destination one.

  3. Start as in option 1, but then write suitable macros in the destination database to do the transfer at the touch of a button from within forms. Thanks to @RobertG for this suggestion and the further details given in other comments he has made

  4. Use the HSQLDB non-standard SQL “CREATE TEXT TABLE …” to create a text table based on a .csv file that is placed in the same folder as the .odb file of the destination database. Then use a View to do any necessary transmogrification of the data, and the standard SQL "INSERT INTO (<columns) SELECT * FROM ". Big thanks to @Villeroy for this suggestion and the very helpful links to some Macro code to do this which you will find in his comments on my question

Each of the options has its merits, in my opinion, and corresponding disadvantages.

  1. The spreadsheet option gives a more flexible interface to importing .csv files if the user actually uses it instead of just taking the default options. In contrast there are far fewer options when creating a connector database in options 1 and 2; and even fewer in option 3 unless you are a real SQL wizard and can write the workarounds in an SQL view.

  2. Is by far the easiest conceptually for someone not interested in the inside details of SQL let alone LO and the Java of HSQLDB. You can still create a view on the imported data if it needs further massaging once it arrives in the destination database, and if you create that view using the LO GUI then you can be confident that it will survive future upgrades of LO. This option leaves the door open to future migration to option 2 once you get other parts of your software working. It also also works for copy/pasting data into a connected MariaDB engine and presumably any other external database (though that is beyond the scope of my question). I have tested this using LO 6.4 on LinuxMint copy/pasting data into both the embedded HSQLDB engine and into an external Mariadb (Ver 15.1 Distrib 10.3.37-MariaDB).

  3. Is only straightforward if you are already at least partly familiar with LO Macro. If/when I attempt this my preference would be to write the code in Python and hook it in from a very simple Macro command to run the Python code. Or even Perl! Although I am fluent in both Perl and Python, one look at Macro sends me back to the 1970’s and Basic. Perhaps that is merely prejudice on my part; but a language looking good is personally important to me when I have the choice. Your preferences may vary of course: Libre means “free as in speech” :slight_smile:

  4. Is only straightforward if you are an SQL wizard and can use that to workaround the limitation that this method only recognises dates if they are in ISO format (2022-12-25 rather than either 25/12/2022 or 12/25/2022). If you write that wizardry within the Edit as SQL option of Views then you can be reasonably confident it will survive future migration. Alternatively if you stick to Standard SQL only then CREATE View will work from Tools->SQL and by saving the SQL code elsewhere you should survive any future migration of LO. For those sufficiently knowledgeable about writing SQL and about .csv complexities this option offers by far the best degree of resliience against one other potential migration issuie: what happens if the idiots at the data provider later change some pernickety detail of their .csv format? Long experience suggests they will make it work in MS Office and one or two other MS packages, but leave free software users drowning. @Villeroy has used this method “for years” so it is well tested, and he is also quite an active volunteer here and I am sure he would answer new questions if you can’t make it work yorself

A request from me I want this question and answer to remain at the strategic level, to help other users choose a strategy. Feel free to comment here if you want to add another option, or a major advantage or disadvantage of any of the above. PLEASE do not ask detailed questions of implementation here – to keep this thread tidy start a new question. By all means link back to these options if you do that.

Almost my final point: you may be interested to know why I am so concerned about future migration. That is because I have been a user of OO Base and later LO Base for years. I survived the migration from LO using embedded Firefox to embedded HSQLDB and the migration was a nightmare for me. The end result is well worth the trouble, but I do not want a similar nighmare again. As I have no control over what the lovely LO devs will do in future I want to do everything I can to make any future transition as seamless as possible.

If you want to use the software you are writing for longer than the expected lifetime of your operating system and for longer than the expected lifetime of your current version of LO then you need to think about this.

For me the issue is that some of the accounts I prepare have to be kept for six or seven years. If the HMRC (our tax people) ask for details within that time I am legally obliged to help them understand why the summary figures I gave them are vaild. I therefore want to be sure that my accounts database will still work in seven years from the day I start work on a particular year’s data.

If you are doing something that you know you will not need to use past your next OS upgrade then that is not such an issue for you. Answer the following questions for yourself first: What is the long-term support promise from your OS provider? From LO for the version you are using?

And finally: please regard this post as CC_BY_ATTRIB This means you are very welcome to copy this answer but if you use a substantial amount of it please acknowledge me as Ask LibreOffice: @trueriver.

At present (2023-01-05) only option 0 is even hinted at in the online help for LO, and likewise in the built-in help. I am in the process of suggesting changes to that documentation which (if / when accepted) will become visible in the online help. You can track the progress of my suggestions here if you are interested.

…and also adding to the wiki, the page is currently (5th Jan) still a work in progress but you can see it here (at least for now), it may move in future as I get more in tune with how the wiki is structured

I always proceed in the same way: by creating an SQL script for the structure (tables, columns, constraints), which means doing everything by hand, after modelling the base of course. Then, if I have a lot of rows to insert, I might go through Calc and copy the columns I paste into this online converter.

No need to create a CSV file beforehand, it works very well directly with the pasted columns of Calc.

And finally, importing lines formatted in SQL.

The tool can not convert strings into dates, times, decimals. Even if there were some artificial intelligence for this, how could it distinguis 8th of November from 11th of August when you throw 8/11/2022 at it? Some idiots write “military” date-times (special integers 20220811 or 234559) into csv. How to prevent part or phone no. 0049851669 being converted into integer? Which tool can automagically distinguish if 1,234 means one-thousand-two-hundred-thirty-four or a comma decimal? You have a well formed database with numeric IDs as VARCHAR, with dates being aware of days, years, months, weekdays and aggregations. Your database has well thought decimal types, integers, floating point decimals and memos. Then some idot working for a bank throws this into your download folder:
00420815,12/31/1999,10:45 pm,‘Doe, John’,’(1,234.98)’
There are far too many crazy flavours of csv out there. Calc can handle most of them properly. @trueriver imports localized text dates into Calc and wonders why the database does not accept “7/8/1999” as a date value although it looks exactly as stored in the csv file.

Yes I do wonder why, given that I have already told Calc at the import stage that I am using D/M/Y dates in that column. You seem to have already forgotten point 2 in my post in this earlier thread where I already covered that point.

Public Request: Given the level of misunderstanding shown by you towards my posts in the last ten days I would prefer that you do not attempt to guess what I think, believe or wonder. Or at least if you make those guesses please do not share them in a public forum. The point you are making about this script is not improved by making a comment about me in passing, and especially so when your comment is inconsistent with what you have already read from me.

Likewise in Base there is an opportunity to choose date formats in Tools → Options → Language Settings → Language.

And turning to the script, it does not need anything of the complexity of an AI to load up a csv, it just needs an appropriate algorithm for the expected date formats or range of possible formats that are reasonably expected. That is no different from the View on the text table in your preferred method which likewise has to be hard coded in a similar way. That could mean that a script is hard coded for one particular kind of input, or it could be that there is some editable configuration variable in the script, or a prompt to the user.

1 Like

If you had dates in that column, the import into Base would not fail.
menu;:View>Highlight Values [Ctrl+F8] highlights numbers such as true dates in blue.
=A2-A1 returns the difference between 2 valid dates in days.
=ISNUMBER(A1) returns TRUE if the date is a date.

I want to add to what I said in public. As Mike pointed out to you, the issue of the date format is only one of many things that can make the paste into Base fail, or leave you with results you did not intend.

For example, see the points I made about not having a PK for the paste (the behaviour mentioned may be imporved in later releaeses: the info given is accurate and tested for LO 6.4)

The date format is only one of about a dozen points in my OP on that thread, but it is the only one you noticed.

Not only did you emphsis on just one point distract from the overall balance of the thread, but you even said that it was the most important one (or words to that effect).

Sorry, most of the steps are important, and getting error messages when trying to paste into a table with an existing PK that is not already in the csv file, or one where the rows in the file duplicate keys already existing can cause the upload to fail.

It is true that my workaround is more simple than it needs to be. You or I could figure out more than one way to upload data from an existing csv into an existing table. The reason I chose not to do that was that the intent was to make it as simple as possible for the relatively new user. As regards PKs I think my recipe will work for any dataset.

You and I are both incorrect (on a pedantic level) if we assert that setting Date and D/M/Y (etc) in the csv upload will always work. Again I oversimplified there. There exist date formats which would fail – the military dates you mentioned are an excellent example. Again that was a deliberate choice on my part: those formats are sufficiently rare that I chose to leave users with that data to ask a question when and if it failed.

It was to get away from the counter-productive (in my opinion) complexification of your comments in the first thread that I started the second one. Despite clearly indicating that the new thread was for an overview of strategies rather than detail, you chose to import complexity into the new thread.

Can you now understand why I was disappointed by your posts to my second question?

If you understand that, would you be willing to try to understand the skill level needed by the questionner in the case of most questions, and the skill level being targeted by a post like mine, which was to try to share an overview of methods to an unskilled audience?

I hope you would be willing to do so. If so, for my part I will offer to give help, advice, and support to do so by pm rather than on the public forums.

It is not easy to write for someone with a smaller skill set than our own: that abilty comes from my working life as a university lecturer (I am now retired). You can’t give a class of freshers (first year students) the same lecture you would give to final year students.

I am willing to coach you, behind the scenes, if you are willing to accept such coaching.

Even if you do not want such coaching behind the scenes, I still hope you will be willing to make the changes I suggest from your own efforts.

I hope this pm lands with you as it is intended: as an honest attempt to be helpful to you and to improve how helpful the forums are to newcomers.

Enjoy the rest of your day, whether you agree with me or not on any of the above :slight_smile:
River~~

This comment is an attempt to continue a misunderstanding from the same OP in the same earlier thread. There I had already said

It is inapproporate to try to continue an argument here that I walked away from on that earlier thread. But given that you chose to do that: exactly what part of that quote led you to think I did not already understand that? Your comment would have been inappropriate on the earlier thread.

Neither is it appropriate on this thread, where the title makes it clear that I am looking at overall strategies not the detail. I thought I had also made that clear in the OP here.

My observation is that you are, for the second time, ignoring an OP when responding to comments within a thread. I wonder how you think that will help the person who posted the OP? I wonder how that will help anyone arriving at this thread from a search on the internal search box or from a search engine.

My second personal request to you therefore is to try to keep all comments relevant to the title and the OP of whatever thread you are posting to. That helps to keep threads uncluttered.

This thread would have been better suited to the intended purpose had you not chosen to continue an argument that started elsewhere, and chosen to do so again after more than one person had commented about that.

In case it is not yet obvious, my hope was to create a concise summary from which people can choose which of the various strategies will best suit their particular needs; and yes indeed will best suit the particular data they have in their .csv.

There are four overall strategies for importing new data from .csv files into existing embedded HSQL:
1) via Calc: This applies to all databases with write access including embedded HSQL and FIrebird. It is doomed to fail if you are unfamiliar with the fundamental spreadsheet Basics which are the same since 4 decades.
2) via SQL text tables: Does not appy to embedded FIrebird because FB does not know any character separated text tables (only fixed lenght text tables). It requires moderate capabilities in SQL. You write a Base query mapping the right types to each other, store it as a view and every time when the csv file has changed you copy the view onto the target table.
3) via csv data source: You connect a second Base document to the csv source and copy from this Base document to the embedded database. I do not recommend this method, particularly when dates and numbers are involved. csv sources are OK for text only tasks such as mail merge.
4) programmatically by your own freely chosen means. This does not involve any office suite.

Any strategy requires that you are able to follow the rules that are imposed by the specific flavour of csv and the rules that are imposed by the database structure. A spreadsheet involves an intermediate layer. A spreadsheet is a simplified programming language with two data types text and numbers, Excel treates booleans as a third datatype, formulas return errors as an additional type.

Thank you very much for this post @Villeroy it provides me exactly the kind of feedback and suggestions I was hoping for :slight_smile:

As a result I have liked it :grinning:

That does not mean I agree totally with you: but I regard our remaining differences as honest professional differences, and totally repect the points you make including the ones on which we differ.

My main difference now is that from my perspective your version overdoes the doom and gloom stuff. I do not believe that the attempt is “doomed to fail” in the majority of cases where the user hasn’t got a clue about the character sets and so on, whereas you clearly do honestly think that.

The reason I disagree is that Calc and Base both use the same locale (located in Tools->Options->Language Settings->Languages) to decide how to interpret characters. Usually (and I agree, not always) that means that if the text fields look OK in Calc they will also do so in Base. In addition to that, if the locale in LO is set to the same as the locale on the computer that generated the csv file, then there is a sporting chance things like umlauts and cedillas will work the same in both systems.

It is not something I would bet my life on, but it is something where I feel comfortable encouraging the user to try it: if and only if it does not work then I would ask them to send in a sample of the file.

What I like most about your re-write of my OP is that it is more concise, and that even so you have given names to each of the four options. That makes discussing the options much easier than my approach of numbering them.

Thank you for this collaboration. Let’s do it again sometime…

Firebird 3.0 language reference
Page 127 - External tables
External Tables
The optional EXTERNAL [FILE] clause specifies that the table is stored outside the database in an external text file of fixed-length records. The columns of a table stored in an external file can be of any type except BLOB or ARRAY, although for most purposes, only columns of CHAR types would be useful.
All you can do with a table stored in an external file is insert new rows (INSERT) and query the data (SELECT). Updating existing data (UPDATE) and deleting rows (DELETE) are not possible.
…"

What a difference a day makes!

I was pleasantly surprised – astonished even – to see a post from @Villeroy that I honestly welcomed.

See Overall strategy for importing new data in .csv files into existing embedded LO database - #25 by Villeroy and my reply one post later. When I woke up this morning I never guessed I would be writing so warmly to him before bedtime.