HILANO - Human-in-the-Loop learning approaches for distributed incremental anonymization of personal data
Human-in-the-Loop learning approaches for distributed incremental anonymization of personal data
The core of this project is the anonymization of personal data in (especially medical and legal) documents. A better, (semi-)automatic anonymization would result in considerable cost savings for existing anonymization tasks and enables Big Data analyses and Text Mining even in sensitive areas. Anonymization under data protection law is a difficult task: not only must certain information be found precisely, but also its role in the respective text and life situation must be understood. In addition, indirect identification must also be prevented for maximum data protection. In order to solve these problems, the technology must to a certain extent understand both the text and the world from which it originates. This can only be done with human help. The core of our methodology is therefore a distributed machine learning approach with the "Human in the Loop (HiL)": anonymization decisions suggested by technology are corrected and supplemented by humans. These changes are used to continuously improve the anonymization models. We will research different architectures for utilizing distributed signals in a centralized model, whcih is re-distributed for anonymization suggestions.
The work plan implements the creation of the distributed architecture for text and OCR image documents in medicine and in the legal context (jurisprudence and justice). There is a need for research regarding the learning environment for the target data and the measurement of the degree of anonymization, technical challenges lie in the deployment for the different domains. Furthermore, the legal aspects will be examined in a legal expert opinion, the project implementation will be accompanied by data protection law, and the actual legal findings will be used as a basis for legal research.
Within the framework of the project, automated approaches to anonymizing personal data in concrete application contexts will be researched. The challenges in research are to be seen in particular in the transfer of models known from literature into prototypical practical suitability, which on the one hand concerns the fit with the incremental HiL approach and on the other hand the scaling to many distributed clients.
- September 2019 - May 2022
- Prof. Dr. Chris Biemann
- Dr. Abhik Jana