Dear all,

thank you for raising the discussion and reflecting on ethical issues and implications regarding our GermEval 2020 task on Prediction of Intellectual Ability and Personality Traits from Text.

We agree that Ethics in NLP is an important problem and we as a community should be aware of the implications of the application of NLP/AI research. We also agree that the rather technical description of our task does not reflect these issues; in particular, IQ tests, as well as all other psychometric tests, are a questionable instrument for many reasons, including the ones that were raised.

We appreciate, that a strong discourse and reflected discussion on ethics, possible biases, the consequences and the purposes of this task have emerged, as those are pressing aspects one should always address when working on anything related to aptitude diagnostics and personality traits.

What we don't agree with is the principled way how some of you suggest to pull the plug and never ever do this kind of research. As a matter of fact, NLP is already used in assessment centers like the one we obtained the data from, but it is used in a crude and unscientific way.
We as a community should be investigating how to use our NLP expertise to understand the assessment problem and improve the current assessment models, which would include both technical and ethical work. The scientific value of this line of research would lead to avoiding more harm than to cause harm. Just not doing stuff because it could potentially be misused is a killer argument against research as a whole.

Regarding our overall aims: we are interested in how textual descriptions of images are related to IQ test outcomes and high school grades (resp. analyzed operant motives for subtask 2). We are interested in what models are most suitable, and what aspects of the text are most informative.
However, a correlation analysis would not make for a good shared task. Thus, we decided to propose a ranking task (resp. classification task) based on the texts.

In particular, we are not interested in automating IQ testing and using it as a selector for college applications. In fact, the collection of the texts is more costly than license fees for an IQ test. NORDAKADEMIE, a small local college with a combined degree program-cum-vocational job training receives more applications than it can host, thus calling for some form of selection. Many colleges use (a reweighed version of) high school grades, which in turn have biases. NORDAKADEMIE, instead, uses a broad aptitude test with multiple measures e.g. a math test, English test, implicit motives, IQ values or Kahnemann assertions and that takes multiple hours. No applicant is declined on the basis of any single of those parts and only a small fraction receives rejections from this part of the application process. Final acceptance decisions are made by companies (that provide the vocational job training) which may use less structured (and potentially more biased) selection criteria.

Regarding racial bias: unfortunately, the German education system is known to have a socio-economic bias, which leads to a vast under-representation of people with a migration background in higher education, which is absolutely worth fighting against.
This, paradoxically, leads to our data being less prone to the influence of such biases. While we, of course, do not have access to names, nationality or other demographic data (such as race: we'd never record this in Germany btw., as this would be seen as discrimination already at data collection stage according to EU laws) in the dataset, take a look at the group picture of the 2019 class of NORDAKADEMIE (the source of this data) to see what we mean: https://www.nordakademie.de/news/nachwuchs-fuer-die-wirtschaft-nordakademie-verabschiedet-293-absolventinnen-und-absolventen (as brought up on Twitter).

Regarding commercial interests, a point rose by Emily on Twitter https://twitter.com/emilymbender/status/1202408144879538176: While of course there is interest from the people that provide this data, we find it remarkable that the data is made available freely. We aim to share the data with the international scientific community, to better understand and learn from the data and discuss interesting findings publicly, for the benefit of everyone. Note that this is the entire data that currently exists, not a sub-sample, so it likewise supports the commercial interests of competitors.
The interests of us task organizers are strictly scientific. There is no funding for this task, neither from the public nor from commercial sources. 

As there are many aspects to the task, to the origin of the data and to a broader picture of e.g. the German college application system, performed aptitude diagnostics or actions against possible biases (e.g. racism, sexism or social discrimination), we are currently working on a better and more reflected description of those matters, which you will find shortly on every task-related website.
We aim to provide as much detail and depth, as we possibly can, to honor the sensitivity of this topic.

So, how do we turn this upheaval into something that moves us as a community forward?

We will take this as an opportunity to clarify both the background and the research questions for our task, revising the workshop page and the call for papers to avoid being misunderstood like this: this is in no way meant to promote the instrument of IQ testing, nor automating it based on text. This is in no way an attempt to label people as stupid, this is in no way presupposing that there is one inherent intellectual ability, etc.
We are also considering a call for papers on the critical reception of psychometrics for this event, or a panel discussion on the topic during the workshop, to create awareness for potential issues with these metrics.

Best,

Dirk Johannßen,
Chris Biemann,
Steffen Remus,
Timo Baumann,
David Scheffer,
as well as the other organizers.