How to tokenize/parse/search&replace document by font AND font style

I need to update a bilingual dictionary written in Writer by first parsing all entries into their parts e.g.

  • main word (font 1, bold)
  • foreign equivalent transliterated (font 1, italic)
  • foreign equivalent (font 2, bold)
  • part of speech (font 1, italic)

Each line of the document is the main word followed by the parts listed above, each separated by a space or punctuation.

I need to automate the process of walking through the whole file, line by line, and place a delimiter between each part, ignoring spaces and punctuation, so I can mass import it into a Calc file. In other words, “each part” is a sequence of character (ignoring spaces and punctuation) that have the same font AND font-style.

I have tried the standard Search&Replace feature, and AltSearch extension, but neither are able to complete the task. The main problem is I am not able to write a search query that says:

Find: consecutive characters with the same font AND font_style, ignore spaces and punctuation

Replace: term found above + “delimiter”

Any suggestions how I can write a script for this, or if an existing tool can solve the problem?


Pseudo code for desired effect:

var delimiter = "|"

Go to beginning of document

While not end of document do:
     var $currLine = get line from doc
     var $currChar = get next character which is not space or punctuation;
     var $font = currChar.font
     var $font_style - currChar.font_style (e.g. bold, italic, normal)

     While not end of line do:
         $currChar = next character which is not space or punctuation;

          if (currChar.font != $font || currChar.font_style != $font_style) { // font or style has changed
               print $delimiter
               $font = currChar.font
               $font_style - currChar.font_style (e.g. bold, italic, normal)
     end While

end While


The python-code below works for me (LO>=5.2), with some example Text similar to your description, it writes directly into .txt -file output.csv

def write_parts_to_textfile():
    doc = XSCRIPTCONTEXT.getDocument()
    wtext = doc.Text
    with open('output.csv', 'w') as output:    
        for para in wtext:
            line = ('|'.join(part.String.strip(' ,.:;\t')
                    for part in para 
                    if part.String.strip(' ,.:;\t')))