GermEval 2020 Shared Task on the Classification and Regression of Cognitive and Motivational Style from Text
https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2020-cognitive-motive.html

[Apologies for cross-postings]

2nd Call for Participation
We invite interested parties from academia and industry to participate in this shared task. 

The validity of high school grades as a predictor of academic development is controversial. Researchers have found indications that linguistic features such as function words used in a prospective student's writing perform better in predicting academic development (Pennebaker et al., 2014) than other methods such as GPA values.

During an aptitude test, participants are asked to write freely associated texts to provided questions, regarding shown images. Psychologists can identify so-called implicit motives from those expressions. Implicit motives are unconscious motives, which are measurable by operant methods. Psychometrics are metrics, which can be utilized for assessing psychological phenomena. Operant methods, are psychometrics, which are collected by having participants write free texts (Johannßen et. al, 2019). Those motives are said to be predictors of behavior and long-term development from those expressions (McClelland, 1988, Scheffer 2004, Schultheiss, 2008).

Operant motives are unconscious intrinsic desires that can be measured by implicit or operant methods, such as the Operant Motive Test (OMT) or the Motive Index (MIX). Psychologists label these textual answers with one of four motives and corresponding levels. The identified motives allow psychologists to predict behavior and long-term development. For our task, we provide extensive amounts of textual data from both, the OMT and MIX, paired with IQ and high school grades and labels.

With this task, we aim to foster novel research within the context of NLP and the psychology of personality and emotion. This task is focusing on utilizing German psychological text data for researching the connection of text to cognitive and motivational style. For this, contestants are asked to build systems to restore an artificial 'rank' as well as performing classification on an image description that psychologists can investigate on implicit motives.

Tasks
This shared task consists of two subtasks, described below. Participants are free to participate in either one of them or both.

+++++ Subtask 1: 
Regression of artificially ranked cognitive and motivational style. 


The task is to predict measures of cognitive and motivational style solemnly based on text. For this, z-standardized high school grades and intelligence quotient (IQ) scores of college applicants are summed and globally ranked. 

For the final results, participants of this shared task will be provided with an MIX_text only and are asked to reproduce the artificial 'ranking' of each student relative to all students in a collection (i.e. within the test set).

 The data is delivered in three files, one containing participant data, the other containing sample data, each being connected by a student ID. The rank in the sample data reflects the averaged performance relative to all instances within the collection (i.e. within train / test / dev), which is to be reproduced for the task.

The success will be measured with the Pearson rank correlation coefficient. 

+++++ Subtask 2: 
Classification of the Operant Motive Test (OMT). 

Operant motives are unconscious intrinsic desires that can be measured by implicit or operant methods, such as the Operant Motive Test (OMT)(Kuhl and Scheffer, 1999). During the OMT, participants are asked to write freely associated texts to provided questions and images. An exemplary illustration can be found in the Data area. Trained psychologists label these textual answers with one of four motives. The identified motives allow psychologists to predict behavior, long-term development, and subsequent success.

For this shared task, participants will be provided with an OMT_text and are asked to predict the motive and level of each instance. The success will be measured with the macro-averaged F1-score.

Data
Since 2011, the private university of applied sciences NORDAKADEMIE performs an aptitude college application test, where participants state their high school performance, perform an IQ test and a psychometrical test called the Motive Index (MIX). The MIX measures so-called implicit or operant motives by having participants answer questions to those images like the one displayed below such as "who is the main person and what is important for that person?" and "what is that person feeling". Furthermore, those participants answer the question of what motivated them to apply for the NORDAKADEMIE.

The data consists of a unique ID per entry, one ID per participant, of the applicants' major and high school grades as well as IQ scores with one textual expression attached to each entry. High school grades and IQ scores are z-standardized for privacy protection. In total there are 2,595 participants, who produced 77,850 unique MIX answers. The shortest textual answers consist of 3 words, the longest of 42 and on average there are roughly 15 words per textual answer with a standard deviation of 8 words. 

The available data set has been collected and hand-labeled by researchers of the University of Trier. More than 14,600 volunteers participated in answering questions to 15 provided images. The pairwise annotator intra-class correlation was r = .85 on the Winter scale (Winter, 1994). The length of the answers ranges from 4 to 79 words with a mean length of 22 words and a standard deviation of roughly 12 words.

Submissions for the validation set via the Codalab page are accepted and published on a leaderboard from January 1st. From May 1st, we will start the final evaluation phase of the task by providing the gold labels of the validation set, which can be used as additional training data. Additionally, the test set samples will be provided, for which we accept submissions until June, 1st. 

Task webpage: https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2020-psychopred.html 

Important Dates
- 01-Dec-2019: Release of trial data and systems
- 01-Jan-2020: Release of training data (train + validation)
- 08-May-2020: Release of test data
- 01-Jun-2020: Final submission of test results
- 03-Jun-2020: Submission of description paper
- 04-11-Jun-2020: Peer reviewing: participants are expected to review other participant's system descriptions
- 12-Jun-2020: Notification of acceptance and reviewer feedback
- 18-Jun-2020: Camera-ready deadline for system description papers
- 23-Jun-2020: Workshop in Zurich, Switzerland at the KONVENS 2020 and SwissText joint conference

The shared task will be accompanied by a pre-conference workshop of the Conference on Natural Language Processing ("Konferenz zur Verarbeitung natürlicher Sprache", KONVENS) hosted on June 23, 2020, at Zürich (https://swisstext-and-konvens-2020.org/).


Workshop Proceedings
Description papers will appear in online workshop proceedings. Participants who submit a description paper will be asked to register at the workshop and present their system as a poster or in an oral presentation (depending on the number of submissions).


Organizers
The shared task is organized by Dirk Johannßen, Chris Biemann, Steffen Remus and Timo Baumann from the Language Technology group of the University of Hamburg (https://www.inf.uni-hamburg.de/en/inst/ab/lt/home.html), as well as David Scheffer from the NORDAKADEMIE Elmshorn, Nicola Baumann from the Universität Trier and the Gudula Ritz from the Impart GmbH (Germany).


GermEval
GermEval is a series of shared task evaluation campaigns that focus on Natural Language Processing for the German language. GermEval has been conducted four times since 2014 in co-location with KONVENS/GSCL conferences. For an overview of the currently conducted tasks, visit https://swisstext-and-konvens-2020.org/shared-tasks/.


+++++ Background Information 
The Aptitude Test, College and criticism

Since 2011, the private university of applied sciences NORDAKADEMIE performs an aptitude college application test.

Zimmerhofer and Trost (2008, p. 32ff.) describe the developments of the German Higher Education Act. A so-called Numerus Clausus (NC) Act from 1976 and 1977 ruled that colleges in Germany with a significant amount of applications have to employ a form of selection mechanism. For most colleges, the NC was the threshold for many applicants. Even though this value is more complex, it roughly can be understood as a GPA threshold. Since this second Higher Education Act, colleges are also free to employ alternate selection forms, as long as they are scientifically sound, transparent and commonly accepted in Germany (BVerfGE 43, 291 - numerus clausus II).

Even though Hell, Trapmann und Schuler (2008, p. 46) found the correlation coefficient of high school grades of r = 0.517 to be the most applicable measure for academic suitability, criticism emerged as well. The authors criticized that the measure of grades by just one single institution (i.e. a high school) does not reflect upon the complexity of such a widely questioned concept of intellectual ability. Schleithoff (2015, p. 6) researched the high school grade development of different German federal states on the issue of grade inflation in Germany and found evidence, that supports this claim. Furthermore, in most parts of Germany, the participation grade makes up 60% of the overall given grade and thus is highly subjective.

Since operant motives are said to be less prone to subjectivity, the NORDAKADEMIE decided to employ an assessment center (AC) for research purposes and a closely related aptitude test for the application procedure (NORDAKADEMIE, 2018). Rather than filtering the best applicants, the NORDAKADEMIE aims with the test for finding and protecting applicants that they suspect to not match the necessary skills required at the college (Sommer, 2012). Thus, every part of the aptitude test is skill-oriented.

During an hour-long aptitude test, participants are asked to write freely associated texts to provided questions and images. Those motives are said to be predictors of behavior and long-term development from those expressions. This test contains multiple parts, e.g. a math- and an English test, Kahnemann scores, intelligence quotient (IQ) scores, a visual questionnaire, knowledge questions to the applied major or the implicit motives, called the Motive Index (MIX).

The MIX measures implicit or operant motives by having participants answer questions to those images like the one displayed on the Tasks tab such as "who is the main person and what is important for that person?" and "what is that person feeling". Furthermore, those participants answer the question of what motivated them to apply for the NORDAKADEMIE.

Even though parts of this test are questionable and are currently under discussion, no single part of this test leads to an application being rejected. Only when a significant amount of those test parts are well below a threshold, applicants may not enter the second stage of the application process, which is applying at a private company due to the integrated study program the college offers. Roughly 10 percent of all applicants who take this aptitude test get rejected on the basis of this test., Every applicant has the option to decline the data to be utilized for research purposes and still can apply to study at the Nordakademie. All anonymized data instances emerged from college applicants that consented for the data to be utilized in this type of research setting and have the opportunity to see any stored data or to have their personal data deleted at any given moment (e.g. gender, age, the field of study).

Any research performed on this aptitude test or the annually conducted assessment center (AC) at the NORDAKADEMIE is under the premise of researching methods of supporting personnel decision-makers, but never to create fully automated, stand-alone filters (NORDAKADEMIE, 2019). First of all, since models might always be flawed and could inherit biases, it would be highly unethical. Secondly, the German law prohibits the use of any - technical or non-technical - decision or filter system, which cannot be fully and transparently be explained. Aptitude diagnostics in Germany are highly legally regulated.

The most debated-upon part of the aptitude test is the intelligence quotient (IQ). Intelligence in psychology is understood as results measured by an intelligence test (and thus not the intelligence of individuals itself). Furthermore, intelligence is always a product of both, genes and the environment. Even though there are hints that the IQ does not measure intellectual ability but rather cognitive and motivational style (DeYoung, 2011), it is defined and broadly understood as such.

Mainly companies in Europe employ IQ tests for selecting capable applicants. In the United Kingdom, roughly 69 percent of all companies utilize the IQ. In Germany, the estimate is 13 percent (Nachtwei & Schermuly, 2009).

Since IQ tests only measure the performance in certain tasks that rather ask for skill in certain areas (logics, language, problem-solving) than cognitive performance, such intelligence tests should rather be called comprehension tests. Minorities can be discriminated by a biased due to unequal environmental circumstances and measurements in non-representative groups (Rushton, & Jensen, 2005). One result of research on the connection between implicit motives and intelligence testing could help to improve early development and guided support.

It is this bias, which leads to unequal opportunities especially in countries where there is a rich diversity among the population. Intelligence testing has a dark history. Eugenics during the great wars e.g. in the US by sterilizing citizens (Buck v. Bell) or in Germany during the Third Reich are some of the most gruesome parts of history.

But even in modern days, the IQ is misused. Recently, IQ scores have been used in the US to determine which death row inmate shall be executed and which might be spared. Since IQ scores show a too large variance, the Supreme Court has ruled against this definite threshold of 70 (Hall v. Florida). However, Sanger (2015) has researched an even more present practice of 'racial adjustment', adjusting the IQ of minorities upwards to take countermeasures on the racial bias in IQ testing, resulting in death row inmates, which originally were below the 70 points threshold, to be executed.

There is an ethical necessity to carefully view, understand and research the way intelligence testing is conducted and how those scores are - if at all - correlated with what we understand as 'intelligence', as they might be mere cognitive and motivational styles. Further valuable research can be conducted to investigate connections between other personality tests such as implicit motives with intelligence or comprehension tests. Racial biases are measurable, variances are great and many critics state that IQ scores reflect upon skill or cognitive and motivational style rather than real intelligence, as it is broadly understood.

Regarding commercial interests: While of course there is interest from the people that provide this data, we find it remarkable that the data is made available freely. We aim to share the data with the international scientific community, to better understand and learn from the data and discuss interesting findings publicly, for the benefit of everyone. Note that this is the entire data that currently exists, not a sub-sample, so it likewise supports the commercial interests of competitors. Furthermore, professors at Universities for Applied Sciences in Germany (especially private colleges) are supposed to work in the private industry on their specific research field (Wikipedia, 2019). Thus, an allegedly conflict of interest is a result of the educational system in Germany. The interests of the task organizers are strictly scientific. There is no funding for this task, neither from the public nor from commercial sources.

References

BVerfGE. (1977). BVerfGE. 43, 291 - numerus clausus II. URL http://www.servat.unibe.ch/dfr/bv043291.html [12/08/2019]

DeYoung, C. G. (2011). Intelligence and personality. In R. J. Sternberg & S. B. Kaufman (Eds.), Cambridge handbooks in psychology. The Cambridge handbook of intelligence (p. 711-737). Cambridge University Press. https://doi.org/10.1017/CBO9780511977244.036 [12/08/2019]

Hall v. Florida. Docket number 12-10882. SCOTUSblog. 27 May 2014. Retrieved 29 May 2014.

Hell, Benedikt, Sabrina Trapmann und Heinz Schuler (2007). Eine Metaanalyse der Validität von fachspezifischen Studierfähigkeitstests im deutschsprachigen Raum. In: Jahrgang 21 (Heft 3), S. 251-270.

Johannßen, D., Biemann, C. and Scheffer, D. (2019): Reviving a psychometric measure: Classification and prediction of the Operant Motive Test. Proceedings of CLPsych 2019, Minneapolis, MN, USA.

Kuhl, Julius and Scheffer, David. 1999. Der operante Multi-Motiv-Test (OMT): Manual [The operant multi-motive-test (OMT): Manual]. Impart, Osnabrück, Germany: University of Osnabrück.

McClelland, David C. 1988. Human Motivation. Cambridge University Press. Nachtwei, Jens & Schermuly, Carsten. (2009). Acht Mythen über Eignungstests. Harvard Business Manager. 09. 6-10.

NORDAKADEMIE. (2018). Campus Forum Nr. 66/Juni 2018. P. 8. Assessment Center an der 
NORDAKADEMIE. [online] https://www.nordakademie.de/sites/default/files/2019-08/CF_66_final.pdf URL. [12/08/2019].

NORDAKADEMIE b. (2019). Datenschutzbestimmungen. [online] Available at: https://auswahltest.nordakademie.de/datenschutz URL [12/08/2019]

NORDAKADEMIE. (2019). Digitale Unterstützung für Personaler - Mitarbeitende finden mithilfe von Künstlicher Intelligenz. [online] Available at: https://www.nordakademie.de/news/digitale-unterstuetzung-fuer-personaler-mitarbeitende-finden-mithilfe-von-kuenstlicher [12/08/2019].

Pennebaker, James W. , Chung, C. K., Frazee, J., Lavergne, G. M., and Beaver, D. I., When small words foretell academic success: The case of college admissions essays, PLOS ONE, vol. 9, no. 12, e115844, 2014, issn: 1932- 6203. doi: 10.1371/journal.pone.0115844. [Online]. Available: http: //journals.plos.org/plosone/article?id=10.1371/journal.pone. 0115844.

Rushton, J. P., & Jensen, A. R. (2005). Thirty years of research on race differences in cognitive ability. Psychology, Public Policy, and Law, 11(2), 235-294. https://doi.org/10.1037/1076-8971.11.2.235

Sanger, Robert M., IQ, Intelligence Tests, 'Ethnic Adjustments' and Atkins (November 21, 2015). American University Law Review, Vol. 65, No. 1, 2015. Available at SSRN: https://ssrn.com/abstract=2706800

Scheffer, David. 2004. Implizite Motive: Entwicklung, Struktur und Messung [Implicit Motives: Development, Structure and Measurement]. Hogrefe Verlag, Go Ìˆttingen, Germany, 1st edition.

Schleithoff, Fabian (2015). Noteninflation im deutschen Schulsystem - Macht das Abitur hochschulreif. In: ORDO - Jahrbuch für die Ordnung von Wirtschaft und Gesellschaft. Bd. 66. De Gruyter Oldenbourg, S. 3-26. ISBN: 978-3-8282-0621-2.

Schultheiss, O. C. (2008). Implicit motives. In O. P. John, R. W. Robins, & L. A. Pervin (Eds.), Handbook of personality: Theory and research (p. 603-633). The Guilford Press.

Sommer, Kristina. (2012). Erst testen, dann bewerben. [online] Available at: https://idw-online.de/de/news492748 URL [12/08/2019].

Wikipedia. 2019. Fachhochschule. [online] Available at: https://en.wikipedia.org/wiki/Fachhochschule URL [12/08/2019].

Winter, David. 1994. Manual for scoring motive imagery in running text. Dept. of Psychology, University of Michigan (unpublished).

Zimmerhofer, Alexander und Günter Trost (2008). Auswahl- und Feststellungsverfahren in Deutschland - Vergangeheit, Gegenwart und Zukunft. In: Studierendenauswahl und Studienentscheidung. 1., Aufl. Hogrefe Verlag, S. 32-42.