Sorting words that need to go into 2 columns in doc or calc but aren't separated by comma between greek and english translation

mikekaganski · April 14, 2019, 12:04pm

Finally … and the question is?

Oh, I have been finally able to find something like a question there :

but is there a suggested formula for doing this please as I have a lot of vocab from my book to sort like this and the bit I have pasted above is a small example

… but it took hard time, and finding the question didn’t allow me to understand the issue actually. A person obviously has software having the same problem as OP has: both OP and the software OP uses don’t suspect there are commas and other punctuation marks in existence.

mikekaganski · April 14, 2019, 12:04pm

Finally … and the question is?

Oh, I have been finally able to find something like a question there :

but is there a suggested formula for doing this please as I have a lot of vocab from my book to sort like this and the bit I have pasted above is a small example

… but it took hard time, and finding the question didn’t allow me to understand the issue actually. A person obviously has software having the same problem as OP has: both OP and the software OP uses don’t suspect there are commas and other punctuation marks in existence.

goonhilly · April 14, 2019, 4:22pm

I meant that I had scanned in with a iphone photo app that turned the photo of the text in the book into “text file” that I pasted into a doc/text file and then tried to get into columns but ended up manually putting in commas between the greek letters and the english and separating manually. This was taking too long and with 2500 words potentially a long exercise.

goonhilly · April 14, 2019, 4:34pm

I was using the simple data formula but thought that there might be a better way to sort the jumble of greek and english letter.
For example I can from my text file created from a text scan app get from the snippet above:- “το γραφείο desk” i.e the greek word for desk is given first-thus I then manually inserted a comma between greek word and the word for desk and was able to then sort into 2 columns - 1 with greek comb of το γραφείο and 2nd column lined up with each greek word is the english word.
The text file created from the scan reader accurately shows the commas between some of the english words where the greek word can have more than 1 meaning so I then had to manually remove those commas and put in a singly comma between the greek and english words. I suppose I could have left the commas separating some of the combinations of english words and inserted a dash/hyphen etc but it was still slow going then thought I need to find some formula that put in a separator afterthe greek

Cookievore · April 14, 2019, 6:57pm

A formular will only be able to seperate and process a data set with defined structure and criteria for field, field seperator and end of data set. If your data contains commas, commas are disqualified as field seperator (except you parenthesise your field with lets say quotation marks). Particularly as the count of commas varies from data set to data set. Choose something else, e.g. semicolon as field seperator. You’ll need something to mark the end of the data set, e.g. # in case there’s no CR/LF. I’d assume that this could be done in three steps. First the “export” (really no other way than photo and ocr process? weird …), second preparing the ascii file, third import ascii as csv file and processing the strings with “formula” in LO calc: splitting them into seperate columns.

But I’m still not able to extract a simple exampel line out of what you write. Therefore it’s still hard to boil it down to your problem. It takes probably longer to understand than to solve ,-)

mikekaganski · April 14, 2019, 7:08pm

In the meanwhile, note that regular expressions allow to use Unicode character properties, like “being Letter” or “Being Greek script”, and even combine thise proprieties to build complex expressions. Possibly something to look into here.

goonhilly · April 15, 2019, 9:58am

thanks for ideas

goonhilly · April 15, 2019, 9:59am

I will look at that and get my wife input as she is good at German. Thanks for updates

Lupp · April 14, 2019, 9:08pm

Based on what @mikekaganski already told.
For starting: See attachment|attachment.
Concerning more sophisticated extended solutions bug tdf#76481 may be annoying.

(This is not helpful if the script of both languages is :Latin: . However, there was this related question in the German /de branch where one of the languages was represented in Italic style. This would allow a similar solution.)

===EDIT1 2019-04-16 13:13 UTC===
Attachment announced in my sixth comment below.
The solution therein can only work in LibO V6.2 or higher because it makes use of the newly implmented Calc function REGEX(). (Thanks to erAck.)

goonhilly · April 15, 2019, 12:11pm

I thought that the “find replace” solution was going to be best but I carefully put the function line of ([:script=Greek:])([:script=Latin:]) in the “Find” box but then I cannot read what is in the Replace - is it S1|S2 or is it a dollar $ sign?
It is very faint but I cannot get it to work as I am possibly having an issue with the lettering albeit that would be odd as I carefully used the attachment odt doc kindly provided so if you have a moment could you confirm what I am doing as I could not find the particular script term in the UCI lists.

I also had a go with the German Cattribs2Html function after I remembered to switch off macro security and got it to copy the text into the HTML column but it would not work in the TRIM columns and just got a “Value!” warning so again thought it maybe something to do with the greek letters I am typing as I use Greek Polytonic Keyboard Switch from ENG UK?

My view is that the Find & R is the best for me as it ignores other char. But some help…

Lupp · April 15, 2019, 4:20pm

$1| $2 Dollar1 Pipe(vertical line) Space Dollar2
You also missed the space behind Greek:]
You didn’t mention your version of LibO. Newer features of the (ICU) RegEx engine may only work in recent versions of LibO.
Never “switch off” macro security! You may choose medium level and permit macros from a trusted source. But you shouldn’t run macros if there isn’t even a basic understanding for them.
There was no “German function”, only a link to a forum in German language.
My hint leading you to an example containing a macro function of mine wasn’t ment to help in your specific case. It was “just for completeness”. The script language isn’t treated by that function and has no standard HTML representation afaik. The #VALUE! error you got was due to the fact that there wasn’t a single Italic (slanted) letter in the text.

goonhilly · April 15, 2019, 5:27pm

Hi I have got macro set for medium and that was just a loose expression on my part but thanks.
I had some limited results following your “space” correction highlighting my poor glasses!
My find: expression has been corrected to this:- ([:script=Greek: ])([:script=Latin:]) and in Replace $1| $2
I tried both 1 space and 2 spaces after the pipe or vertical line entry BUT each time I got this and a limited run the result is as follows;
The pipe line is going in the wrong place and I am at a loss to know what to change as my greek words are typed in Greek letter on Polytonic and I am using LO 6.2.2.2-64 on MS Windows10
άλλος other
Γενικάg| e nerally
δικό τουςt| h eirs
μικροβιολογικός microbiological
τα έπιπλα furniture
το εργαστή laboratory
το ισόγειοg| r ound floor
μοντέρνος modern
ξεχωριστόςs| e parate
όπου wherei| n which
κάθεe| a che| v ery

I get gaps and pipe going in after the first English letter or missing completely?
Reg Exp and Dia Sen both ticked can you see any

Lupp · April 15, 2019, 6:06pm

The space needs to be behind the closing square bracket.
If you cannot assure that there is exactly one space between the last Greek word and the first English word, you need to use + (space followed by +) instead of the single space.
If you don’t learn about the basics concerning regular expressions you might better not use them.

goonhilly · April 16, 2019, 5:22am

OK but where is the best page link to learn about the spacing as it appears to me that I might be getting the spaces in wrong order as follows ([:script=Greek: ])([:script=Latin:]) as I was advised to leave a space "in front of the square bracket of the first part of the expression ie ([:script=Greek: ])?
You refer to exactly one space and ensuring that there is one space but that means surely back to square 1 and manually going through and altering or checking spaces on each row of text ???

goonhilly · April 16, 2019, 6:18am

Hi I got up early and “cleaned my glasses”- don’t know where it went wrong but I GOT IT TOO WORK and have to thank you guys for your patience! Anyway I would still like to know where I should start on learning the basics on making such entries as ([:script=Greek:] )([:script=Latin:]) -note I got the space after ].
As I got another issue but that one is for me to resolve as I can hear the sound of stampede- oh no not him again!
No seriously thanks for this support and I have now taken out annual subs to LO.

Lupp · April 16, 2019, 11:37am

I am not an expert in Regex. If I need to find something I’m not sure about, I mostly start with this link. It leads to a very “complete” guiide to Regex, and it sometimes isn’t easy to find the right page.
LibreOffice makes use of a free and open third-party Regex engine by ICU. @mikekaganski already pointed to one of their root page on Regex in his second comment on the original question, but it isn’t exactly a tutorial.
There are many tutorials about Regex in the web and you may find one suiting you better. Anyway you need to understand that there are different “dialects” and often more than one way to appeal to the same functionality. To restrict a part of a search expression to characters in Greek script, e.g, you may write [:script=Greek:]+ reminding you of the concept of character classes or (synonimous) \p{Greek} emphasizing the concept of unicode properties.

Lupp · April 16, 2019, 11:45am

A specific Regex engine may support both ways or just one of them (and probably even none of them). In the case under discussion ICU supports both approaches.
If you want to understand Regex in depth you should be ready to spend a substantial amount of time on learning and experimenting. Have a lot of fun!
(Historically RegEx is an invention by mathematicians / theoreticians on formal languages. It was extended, enhanced, and partly specialized aiming at the usage we have to cope with here.)

goonhilly · April 16, 2019, 12:18pm

Thanks for this very helpful link that I have had a look at. I realise that I cannot expect answers on plate and as I pointed/confirmed I managed to get it to work so am grateful for all that commented. I was possibly tired as when I retried early this morning it worked and I even managed to sort out the clump of text that my OCR scanner produces via my Iphone. In other words I can auto insert the pipe delimiter and then press enter on each greek new word and get it to a single column that is then easy with pipe line to sort.
I had previously spent time endeavouring to find app/ program/software that would recognise the greek letters and preserve the columns in my book of Greek course I am learning but the only 1 find was what I ended up with. I noted that someone commented weird OCR scan but that was what I had to work with.
It takes a photo of neatly arrange page where there are 4 columns of text i.e 2 groups of Gr-En Gr En and produces a lump of text that is OCR’d but needs arrange

Lupp · April 16, 2019, 12:29pm

You may send a typical one of your photographs to the email account you should find in my user info. I would probably play with it to find a good way in your sense. However, I cannot spend much time on it at the moment.

goonhilly · April 16, 2019, 12:31pm

What I have done is progressed with getting on with my project using what I have learnt thus far. I end up with the pipe line inserted with between greek and english pairs and I can plod through to then produce a singly column in ODTdoc to sort into 2 cell column 1st has greek and 2nd has english that is then imported into Anki flash cards.
I need to look at seeing if I can get around the manual part of pressing enter before each of the start of the greek words to get into a single column then when copied over to “ods” doc I can sort it with the data text to columns function in “calc”
περασμένος - η - ο | last άλλωστε | besides ανδρ κός - ή - o men 's αποφεύγω 1 avoid γνωστός - ή - o well known γυναιχείος - α - ο | women 's δύσκολος - η - ο | difficult , fussy τα ρούχα | clothes

So there must be away once I have got the first pipe insert and to avoid manually getting it initially into a singly column to automate the manual press “return key” before each greek word?