Synchronising Distributed Scraping

  • G. Meesters

Student thesis: Master's Thesis

Abstract

Price differentiation refers to a commercial strategy of charging different prices for the same product or service. A given e-commerce company can offer the same items through multiple outlets, such as a website or a mobile application. There have been rumors that there are price differences between equivalent items offered on different outlets. We would like to verify these rumors.
To assist in comparing outlets, data needs to be collected on a large scale simultaneously. Manual data collection can be used, however the amount of data that can be collected manually is limited. Another problem with manual extraction is that equivalent items from different outlets have likely not been extracted at exactly the same time.
In this study, a distributed and synchronized web scraping system is designed. An unlimited number of web bots taking jobs in a pub/sub system can be accommodated that synchronize to each other. To validate the design, an experiment with price differentiation in the travel industry is conducted with a focus on flight ticket prices.
Date of Award27 Aug 2021
Original languageEnglish
SupervisorHugo Jonker (Examiner) & Benjamin Krumnow (Co-assessor)

Cite this

'