Ask Your Question
0

how to tokenize/parse/search&replace document by font AND font style

asked 2017-02-23 05:11:08 +0200

kaanchan gravatar image

updated 2017-02-23 05:38:17 +0200

I need to update a bilingual dictionary written in Writer by first parsing all entries into their parts e.g.

  • main word (font 1, bold)
  • foreign equivalent transliterated (font 1, italic)
  • foreign equivalent (font 2, bold)
  • part of speech (font 1, italic)

Each line of the document is the main word followed by the parts listed above, each separated by a space or punctuation.

I need to automate the process of walking through the whole file, line by line, and place a delimiter between each part, ignoring spaces and punctuation, so I can mass import it into a Calc file. In other words, "each part" is a sequence of character (ignoring spaces and punctuation) that have the same font AND font-style.

I have tried the standard Search&Replace feature, and AltSearch extension, but neither are able to complete the task. The main problem is I am not able to write a search query that says:

Find: consecutive characters with the same font AND font_style, ignore spaces and punctuation

Replace: term found above + "delimiter"

Any suggestions how I can write a script for this, or if an existing tool can solve the problem?

Thanks!

Pseudo code for desired effect:

var delimiter = "|"

Go to beginning of document

While not end of document do:
     var $currLine = get line from doc
     var $currChar = get next character which is not space or punctuation;
     var $font = currChar.font
     var $font_style - currChar.font_style (e.g. bold, italic, normal)

     While not end of line do:
         $currChar = next character which is not space or punctuation;

          if (currChar.font != $font || currChar.font_style != $font_style) { // font or style has changed
               print $delimiter

               $font = currChar.font
               $font_style - currChar.font_style (e.g. bold, italic, normal)
          }
     end While

end While
edit retag flag offensive close merge delete

1 Answer

Sort by » oldest newest most voted
0

answered 2017-02-23 09:28:13 +0200

karolus gravatar image

updated 2017-02-23 09:52:11 +0200

Hallo

The python-code below works for me (LO>=5.2), with some example Text similar to your description, it writes directly into .txt -file output.csv

def write_parts_to_textfile():
    doc = XSCRIPTCONTEXT.getDocument()
    wtext = doc.Text
    with open('output.csv', 'w') as output:    
        for para in wtext:
            line = ('|'.join(part.String.strip(' ,.:;\t')
                    for part in para 
                    if part.String.strip(' ,.:;\t')))

            output.write('{}\n'.format(line))
edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2017-02-23 05:11:08 +0200

Seen: 59 times

Last updated: Feb 23 '17