How do you prevent a tab-delimited .txt file from dividing between rows for each group??

gwiz · December 25, 2017, 3:10pm

Hi. Anyone help me out. I have a regex I’m looking to export as a simple tab-delimited .txt to LibreOffice. For some of the captured groups, they’re dividing into two rows, when they shouldn’t be. Is there anyway to stop this from happening?

If you copy and paste the first 10 — substitution section — into LibreOffice Calc, you’ll get a clear understanding as to what I’m trying to do. See here regex101: build, test, and debug regex

You can see that numbers 1, 3, 7 and 8 are on two lines. Is there a way to keep each number on one row only, without breaking the English sentence into two rows?

jimk · December 26, 2017, 7:58pm

This is a well-written question, and I would upvote it except that it doesn’t really seem to be about LibreOffice. At least, I do not think the best way to solve it involves LibreOffice. It would be more appropriate for https://stackoverflow.com/.

jimk · December 27, 2017, 4:24am

Apparently, this is a follow-up question from php - How to add whitespace & punctuation marks to capture first group with regex? How to stop certain tabs dividing into two columns within LibreOffice? - Stack Overflow.

jimk · December 26, 2017, 7:55pm

It looks like the regex correctly matches the string. Now, write a substitute expression to remove the newlines from the “word” group. The following Python example illustrates this.

import re
instring = """1

the dictionary also had
useful phrases
"""
# This is just a simplification of your expression.  It's not very important.
regex = re.compile(r"""
^
(?P<frequency>[0-9]+) \W+
(?P<word>.+)
""", re.VERBOSE | re.MULTILINE | re.UNICODE | re.DOTALL)
matchobj = regex.search(instring)
print(matchobj.group("word"))

# This is the important part to answer your question.
print(re.sub(r"\n", " ", matchobj.group("word")))

Output:

the dictionary also had
useful phrases

the dictionary also had useful phrases

Write the result to the tab-delimited file for importing into Calc.

gwiz · December 27, 2017, 12:19pm

Hi Jim. Thanks for taking the time to answer my question. Just a follow-up to what you’ve stated.

While trying you’re best not to facepalm yourself. Where exactly do I write the substitute expression to remove the newlines from the “word” group?

It seems I’m exporting this wrong. Here’s my steps:

.1. Copying all info. below ‘Substitution’ (1 el, la art the (+m, f) el diccionario tenía…) and simply putting
it into Atom and saving the file.txt (cont…)

gwiz · December 27, 2017, 12:21pm

.2. Next, opening LIbreOffice Calc and opening the file.txt, while making sure that in the import options
in Calc I have ‘Tab’ selected in the ‘Separator Options’.

Am I exporting this the wrong way? If you can explain the steps as simple as you can, I’d be grateful. Thanks.

jimk · December 28, 2017, 12:49am

Compared with your steps, my answer shows what should happen at step .0. What I’m suggesting is to use two regexes, as in my Python example where the first is regex.search and the second is re.sub. It is possible with any typical programming language such as Python, Java, C++, C#, VB.‍NET. I would think PHP could do it as well. However, perhaps I don’t fully understand the question. Are you indeed writing a server-side PHP script, as the tag suggests?

AlexKemp · February 10, 2021, 3:35pm