Anonymising a text using spaCy in Python

I’m a translator. I work on confidential documents, so I can’t use cloud-based processing without anonymising them. I would like to use built-in Python in Writer, with the NER library spaCy to detect and process proper names. I’m a beginner in Python, have only ever written simple standalone programs, although I have extensive experience in other languages (mainly Object Pascal, in the Delphi variant). I would need a few pointers to get started using Python inside LibreOffice Writer.

LibreOffice does not use the cloud unless you save to it.

I know this. But as a translator I don’t use only LibreOffice, but also translation software, some of which (like AI services and some CAT tools) operates in the cloud. This is why I need to anonymise documents before using such services. Managing the translation in LibreOffice is the reason why I wish to perform the anonymisation step within LibreOffice. Hence my request for assistance in setting up and using a Python library designed to assist in the anonymisation process.

I have not done this, so only some general hints: The extension APSO helps managing Python scripts. If it is not already installed search it at extensions, but read the rest, before installing.
.
APSO was not designed to attach big python projects to LibreOffice, so it does not help to load modules and it does not include pip (wich is also not included in the Python included on Windows by LibreOffice). So while you can use Python, you may have trouble to use an external library from a macro.
.
How to work around this:

  • If you are on Linux with a regular install of LibreOffice it may use the Python, wich came with the OS. Then you have pip and you only need to find out how to load the module. Bit even on Linux, this may be different inside Snap/Flatpack/Apimage…

  • I read about a new approach, wich includes pip in APSO. Start here:
    Apache OpenOffice Community Forum - APSO : new version with pip package manager - (View topic)

  • Other extensions to access Pythonpath
    or use pip
    https://extensions.libreoffice.org/en/extensions/show/41996

  • Instead of extending the internal python one may consider to use an external program, wich communicates with LibreOffice over a port. So you have your “usual” environment for Python but can still process files in LibreOffice… (I read some use Jupyter as a shell…)

What is that anonymising thing about? Is it about replacing real person names with alias names temporarily and back to real names when work is done?

this is exactly the idea. Replace all occurrences of Mickey Mouse with Name1, all occurrences of Minnie Mouse with Name2, all occurrences of 27-12-1998 with Date1, etc. Then use AI or other cloud tools to do whatever I need to do (e.g. translate), and then replace the dummies back with the original strings. Although I use only cloud tools that guarantee confidentiality, not all my clients believe this, so I prove to them that I don’t send confidential information online. I do this by hand, but if I can find a way to help me doing this fast I’ll use it. One of the options I’m examining is the Python functionality of LibreOffice.

Names and dates can be identified by means of regular expressions or even a most basic local AI. Do the replacements in plain text and generate another plain text listing about what has been replaced with what. This does not require any Office suite so far.
A macro to do the same replacements in an office document based on the replacements file is trivial, even if you write it in StarBasic.