Ask Your Question

how to tokenize/parse/search&replace document by font AND font style [closed]

asked 2017-02-23 05:11:08 +0200

kaanchan gravatar image

updated 2017-02-23 05:38:17 +0200

I need to update a bilingual dictionary written in Writer by first parsing all entries into their parts e.g.

  • main word (font 1, bold)
  • foreign equivalent transliterated (font 1, italic)
  • foreign equivalent (font 2, bold)
  • part of speech (font 1, italic)

Each line of the document is the main word followed by the parts listed above, each separated by a space or punctuation.

I need to automate the process of walking through the whole file, line by line, and place a delimiter between each part, ignoring spaces and punctuation, so I can mass import it into a Calc file. In other words, "each part" is a sequence of character (ignoring spaces and punctuation) that have the same font AND font-style.

I have tried the standard Search&Replace feature, and AltSearch extension, but neither are able to complete the task. The main problem is I am not able to write a search query that says:

Find: consecutive characters with the same font AND font_style, ignore spaces and punctuation

Replace: term found above + "delimiter"

Any suggestions how I can write a script for this, or if an existing tool can solve the problem?


Pseudo code for desired effect:

var delimiter = "|"

Go to beginning of document

While not end of document do:
     var $currLine = get line from doc
     var $currChar = get next character which is not space or punctuation;
     var $font = currChar.font
     var $font_style - currChar.font_style (e.g. bold, italic, normal)

     While not end of line do:
         $currChar = next character which is not space or punctuation;

          if (currChar.font != $font || currChar.font_style != $font_style) { // font or style has changed
               print $delimiter

               $font = currChar.font
               $font_style - currChar.font_style (e.g. bold, italic, normal)
     end While

end While
edit retag flag offensive reopen merge delete

Closed for the following reason the question is answered, right answer was accepted by Alex Kemp
close date 2020-09-25 14:21:37.994648

1 Answer

Sort by » oldest newest most voted

answered 2017-02-23 09:28:13 +0200

karolus gravatar image

updated 2017-02-23 09:52:11 +0200


The python-code below works for me (LO>=5.2), with some example Text similar to your description, it writes directly into .txt -file output.csv

def write_parts_to_textfile():
    doc = XSCRIPTCONTEXT.getDocument()
    wtext = doc.Text
    with open('output.csv', 'w') as output:    
        for para in wtext:
            line = ('|'.join(part.String.strip(' ,.:;\t')
                    for part in para 
                    if part.String.strip(' ,.:;\t')))

edit flag offensive delete link more

Question Tools

1 follower


Asked: 2017-02-23 05:11:08 +0200

Seen: 77 times

Last updated: Feb 23 '17