SADDLE, SECTION ALIGNMENT TO DETECT DOCUMENT LINKS EFFICIENTLY

  • John Stegink

Student thesis: Master's Thesis

Abstract

Finding the correct information is essential to knowledge workers for the optimal performance of their work. They use search engines to find this information in large collections of documents. When documents are found, links to similar documents (semantic links) can help them navigate through this information. Often, subject-related information is only available in knowledge bases with a relatively small set of documents, sometimes in languages other than English.
Existing research on creating semantic links focuses on small documents written in English and using large sets of documents to train the software. This research aims to overcome these limitations by creating software called SADDLE for creating semantic links in document collections with less than 10K documents in English or Dutch.
SADDLE uses existing text embedding methods to find relations between sections in documents, focusing on performance and the use of computer resources. A neural network is trained to translate the section similarities into document similarities. Existing datasets that can be used to train and test SADDLE were found during this research. However, the datasets are generic and contain only a few documents. To be able to conduct tests on representative
document sets, software was created to generate sets of Wikipedia documents
about a specific subject using existing Wikipedia-based ontologies and the meta information contained in the Wikipedia hyperlinks. Unfortunately, the quality of the generated document set was too low for training and validation of SADDLE.
When using existing document sets found during research, the conclusion is that SADDLE does not improve the quality of semantic links compared to using the embedding algorithmapplied to the total document. SADDLE is not trained optimally because the number of documents in the datasets is low. New research could potentially improve on this limitation by using larger datasets. Creating datasets on a specific subject could probably be improved using metadata other than hyperlinks. Finally, software was created for a user
interface to display the relations between sections of documents; this user interface could be improved to be used in knowledge bases. The sources of all created software are publicly available.

Date of Award15 Feb 2024
Original languageEnglish
SupervisorGideon Maillette de Buij Wenniger (Examiner) & Clara Maathuis (Supervisor)

Master's Degree

  • Master Software Engineering

Cite this

'