Acquisition and Integration of Public Data to Improve Detection of Scientific Fraud

  • E. Westerbaan

Student thesis: Master's Thesis


Situation. The output of scientists is assessed by metrics like number of publications and citations. To turn these metrics to their benefit, various forms of fraud are being conducted. For fraud in publications, like plagiarism, various detection methods are in place. However, for post-publication fraud (fraud conducted after the actual production of an article) these detection methods are not structural in place and detection requires a lot of human effort. Improvement in detecting these types of fraud is to direct this human effort to the most egregious cases. However, generic approaches to detect these cases have failed because:
• Detection is applied on the whole population: a person can be outlier within a sub-group (e.g. editor targets its own journal, but other editors do not), but not within the whole population (other authors also target that journal);
• Public datasets these research are based upon, are too limited.

Research. In this research we investigate the added value of enriching existing publicly datasets with not yet integrated publicly available data for directing manual effort for fraud detection in the publication process. As validation we apply a group based outlier detec-tion.
Main contributions. Our main contributions are:
1. Improved set-theoretic publication datamodel which can be used to reason about the publication model. Sources to load this model are discussed.
2. Acquisition methods to gather the necessary data to improve existing publicly avail-able datasets. These enriched datasets can be used for fraud detection.
3. Case studies where we define sets gathered from public and additional acquired data. On these sets we apply a limited analysis to identify outliers. This is the proof our approach is actually working.
4. Recommendations to the research community which improves the detection of fraud on the publication process.

Key points. The results of this research are:
• For this study, it is not feasible to create one integrated dataset out of multiple pub-licly available datasets;
• Enriching standard datasets (datasets prepared for publication process analysis) is made more difficult because of diffusion coupled with lack of structure of data;
• Applying group based approach on enriched data yields useable results.
Date of Award1 Apr 2022
Original languageEnglish
SupervisorHugo Jonker (Examiner) & Arjen Hommersom (Supervisor)

Master's Degree

  • Master Software Engineering

Cite this