How to tokenize/parse/search&replace document by font AND font style

kaanchan · February 23, 2017, 4:11am

I need to update a bilingual dictionary written in Writer by first parsing all entries into their parts e.g.

main word (font 1, bold)
foreign equivalent transliterated (font 1, italic)
foreign equivalent (font 2, bold)
part of speech (font 1, italic)

Each line of the document is the main word followed by the parts listed above, each separated by a space or punctuation.

I need to automate the process of walking through the whole file, line by line, and place a delimiter between each part, ignoring spaces and punctuation, so I can mass import it into a Calc file. In other words, “each part” is a sequence of character (ignoring spaces and punctuation) that have the same font AND font-style.

I have tried the standard Search&Replace feature, and AltSearch extension, but neither are able to complete the task. The main problem is I am not able to write a search query that says:

Find: consecutive characters with the same font AND font_style, ignore spaces and punctuation

Replace: term found above + “delimiter”

Any suggestions how I can write a script for this, or if an existing tool can solve the problem?

Thanks!

Pseudo code for desired effect:

var delimiter = "|"

Go to beginning of document

While not end of document do:
     var $currLine = get line from doc
     var $currChar = get next character which is not space or punctuation;
     var $font = currChar.font
     var $font_style - currChar.font_style (e.g. bold, italic, normal)

     While not end of line do:
         $currChar = next character which is not space or punctuation;

          if (currChar.font != $font || currChar.font_style != $font_style) { // font or style has changed
               print $delimiter
     
               $font = currChar.font
               $font_style - currChar.font_style (e.g. bold, italic, normal)
          }
     end While

end While

karolus · February 23, 2017, 8:28am

Hallo

The python-code below works for me (LO>=5.2), with some example Text similar to your description, it writes directly into .txt -file output.csv

def write_parts_to_textfile():
    doc = XSCRIPTCONTEXT.getDocument()
    wtext = doc.Text
    with open('output.csv', 'w') as output:    
        for para in wtext:
            line = ('|'.join(part.String.strip(' ,.:;\t')
                    for part in para 
                    if part.String.strip(' ,.:;\t')))
        
            output.write('{}\n'.format(line))

AlexKemp · September 25, 2020, 12:21pm