Final Thesis: ETL Data Pipelines Configurations in Spark

Abstract: The JValue Open Data Service (ODS) is an ETL data pipeline that provides data extraction from different source systems (Extract), performs transformations on the extracted data (Transform), and loads the data to a target database (Load). There are different kinds of stream processing engines that cope with data that have high volume, variety, and velocity. Existing ETLs cannot be applied to different streaming services, and the use of various frameworks and programming languages brings complexity along. Among different streaming services, Apache Spark offers accelerated, reusable, and scalable ETLs. This thesis aims to suggest an approach to compile and configure a data pipeline and have it runnable on Apache Spark.

Keywords: ETL pipeline, stream processing

PDF: Bachelor Thesis

Reference: Gizem Batmaci. ETL Data Pipelines Configurations in Spark. Bachelor Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2022.

Final Thesis: Design and Implementation of a Version Control System for Open Data Modelling Projects

Abstract: Many modern software applications and research projects depend on the ability to access high-qualitative data sources. Even though there is already a large number of openly available data sets, such data sets are often hard to (re)use due to various barriers such as incomplete documentation, wrong or missing values, and more. To address these barriers, the JValue Project has been established by the Professorship of Open Source Software at Friedrich-Alexander-Universität Erlangen-Nürnberg. The goal of the JValue Project is to “make open data easy, safe, and reliable”. In the context of the JValue Project, numerous software applications are developed which, among others, allow to explicitly define the structure and further meta information of openly available data sets. However, it is currently neither possible to collaborate with other individuals on such data source configurations, nor is it possible to retrace the historic development that led to the current state of a particular configuration. To build a basis to address these issues, a Version Control System shall be developed, which makes it possible to store, retrieve, and compare revisions of files containing data source configurations and related information. This thesis presents a concept of such a system, and evaluates this concept by implementing a prototype showing its feasibility. As a result of this thesis, it is now possible for other applications developed in the context of the JValue Project to access, create, and compare revisions in order to provide advanced collaboration and versioning features to end users.

Keywords: Version control systems, open data, collaboration

PDF: Master Thesis

Reference: Martin Buchalik. Design and Implementation of a Version Control System for Open Data Modelling Projects. Master Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2022.