# how to tokenize/parse/search&replace document by font AND font style

I need to update a bilingual dictionary written in Writer by first parsing all entries into their parts e.g.

• main word (font 1, bold)
• foreign equivalent transliterated (font 1, italic)
• foreign equivalent (font 2, bold)
• part of speech (font 1, italic)

Each line of the document is the main word followed by the parts listed above, each separated by a space or punctuation.

I need to automate the process of walking through the whole file, line by line, and place a delimiter between each part, ignoring spaces and punctuation, so I can mass import it into a Calc file. In other words, "each part" is a sequence of character (ignoring spaces and punctuation) that have the same font AND font-style.

I have tried the standard Search&Replace feature, and AltSearch extension, but neither are able to complete the task. The main problem is I am not able to write a search query that says:

Find: consecutive characters with the same font AND font_style, ignore spaces and punctuation

Replace: term found above + "delimiter"

Any suggestions how I can write a script for this, or if an existing tool can solve the problem?

Thanks!

Pseudo code for desired effect:

var delimiter = "|"

Go to beginning of document

While not end of document do:
var $currLine = get line from doc var$currChar = get next character which is not space or punctuation;
var $font = currChar.font var$font_style - currChar.font_style (e.g. bold, italic, normal)

While not end of line do:
$currChar = next character which is not space or punctuation; if (currChar.font !=$font || currChar.font_style != $font_style) { // font or style has changed print$delimiter

$font = currChar.font$font_style - currChar.font_style (e.g. bold, italic, normal)
}
end While

end While

edit retag close merge delete

Sort by » oldest newest most voted

Hallo

The python-code below works for me (LO>=5.2), with some example Text similar to your description, it writes directly into .txt -file output.csv

def write_parts_to_textfile():
doc = XSCRIPTCONTEXT.getDocument()
wtext = doc.Text
with open('output.csv', 'w') as output:
for para in wtext:
line = ('|'.join(part.String.strip(' ,.:;\t')
for part in para
if part.String.strip(' ,.:;\t')))

output.write('{}\n'.format(line))

more