Final Thesis: ETL Data Pipelines Configurations in Spark

Abstract: The JValue Open Data Service (ODS) is an ETL data pipeline that provides data extraction from different source systems (Extract), performs transformations on the extracted data (Transform), and loads the data to a target database (Load). There are different kinds of stream processing engines that cope with data that have high volume, variety, and velocity. Existing ETLs cannot be applied to different streaming services, and the use of various frameworks and programming languages brings complexity along. Among different streaming services, Apache Spark offers accelerated, reusable, and scalable ETLs. This thesis aims to suggest an approach to compile and configure a data pipeline and have it runnable on Apache Spark.

Keywords: ETL pipeline, stream processing

PDF: Bachelor Thesis

Reference: Gizem Batmaci. ETL Data Pipelines Configurations in Spark. Bachelor Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2022.

Final Thesis: Design and Implementation of a Version Control System for Open Data Modelling Projects

Abstract: Many modern software applications and research projects depend on the ability to access high-qualitative data sources. Even though there is already a large number of openly available data sets, such data sets are often hard to (re)use due to various barriers such as incomplete documentation, wrong or missing values, and more. To address these barriers, the JValue Project has been established by the Professorship of Open Source Software at Friedrich-Alexander-Universität Erlangen-Nürnberg. The goal of the JValue Project is to “make open data easy, safe, and reliable”. In the context of the JValue Project, numerous software applications are developed which, among others, allow to explicitly define the structure and further meta information of openly available data sets. However, it is currently neither possible to collaborate with other individuals on such data source configurations, nor is it possible to retrace the historic development that led to the current state of a particular configuration. To build a basis to address these issues, a Version Control System shall be developed, which makes it possible to store, retrieve, and compare revisions of files containing data source configurations and related information. This thesis presents a concept of such a system, and evaluates this concept by implementing a prototype showing its feasibility. As a result of this thesis, it is now possible for other applications developed in the context of the JValue Project to access, create, and compare revisions in order to provide advanced collaboration and versioning features to end users.

Keywords: Version control systems, open data, collaboration

PDF: Master Thesis

Reference: Martin Buchalik. Design and Implementation of a Version Control System for Open Data Modelling Projects. Master Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2022.

Final Thesis: Giving Structure to Open Data in the JValue ODS

Abstract: Nowadays the internet provides a lot of open data for public use. Those can be written in various data types and cover plenty of subjects. Because of that the absence of a standard results into the main problem. Every provider can decide for himself how the data is constructed.

The JValue project is dedicated to this problem and aims to be the central point where those open data are gathered and optimized. Currently the JValue Open- Data-Service (ODS) provides the extraction, transformation and retrieving of open data supporting numerous protocols and data formats.

However until now there is only a very generic interface for the retrieval of those open data since the system currently ignores any data structure. In addition to that any provider can alter their data structure and upload it after the adjustment process, since they are not bound to any restrictions. This can lead to major restrictions or even the loss of the data gathering process.

To counteract this behavior a process shall be introduced, which allows the ODS to structure those open data. Furthermore a schema recommendation for the data should be generated, which then will be the foundation of the remaining data gathering process.

As a consequence of the introduced data schema there is now a possibility to also derive fitting database tables from those schema. This tables should be created and filled dynamically and provide the user a fully and easy accessible interface. As an implication of the persistent structured data, the earlier mentioned problem of frequently changing data structures can now be easily solved. The schema can be used to validate those imported and transformed data. By also adding a corresponding visual state to those data configurations, the user will be able to react up on changed data structures.

Keywords: data engineering, schema recommendation, open data

PDF: Master Thesis

Reference: Alexander Mahler. Giving Structure to Open Data in the JValue ODS. Master Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2021.

Final Thesis: Implementing an Open Data ETL Processing Engine with Kafka

Abstract: The JValue project group is developing a modeling ecosystem for Extract Transform Load (ETL) processes. Part of this ecosystem is a description model for those. This thesis suggests a conversion process from the description model into an Apache Kafka runtime, described in a cloud-native format, like Docker Compose. The conversion is implemented as a library and done in a multi-phase approach as known from classical compilers. In the first step, the description language is converted into a runtime independent intermediate description and afterward in a description of a concrete runtime, in this case, Kafka. The multi-phase approach minimizes the implementation work for additional runtimes and allows runtime independent optimization and analysis. The goal for the generated runtime is to use existing Kafka components, which is only partially possible due to the complexity of the description model.

Keywords: open data, compiler, Apache Kafka

PDF: Master Thesis

Reference: Fabian Arnold. Implementing an Open Data ETL Processing Engine with Kafka. Master Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2022.

Final Thesis: Elasticity Concept for Microservice-based System

Abstract: Software Elasticity is the concept of adapting available resources to the current or expected workload. This concept fits modern and stateless microservice architectures, which are scalable by design. Their scalability is closely related to Software Resilience and places new demands on cloud architectures. The JValue Open Data Service (JValue ODS) is an open data platform with focus on Extract, Transform, Load (ETL) pipelines and aims to make the usage of open data easy, reliable and safe. For the success of the ODS, an Elastic and therefore Resilient hosting is mandatory. This thesis deploys the ODS to an on-premise Kubernetes cluster to improve the uptime guarantee, discusses different deployment strategies, elaborates horizontal microservice scaling techniques and operates the necessary infrastructure. This thesis presents Peffer’s Design Research Process to build a concept for Elasticity in microservice-based architectures. The concept is demonstrated and evaluated in the context of the JValue ODS.

Keywords: Microservices, elasticity, scalability, kubernetes, devops

PDF: Master Thesis

Reference: Aron Metzig. Elasticity Concept for a Microservice-based System. Master Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2022.

Final Thesis: Testing Microservice Integration with Consumer-Driven Contract Tests

Abstract: Microservice-Systeme bestehen aus eigenständigen, verteilten Services, die über Netzwerkverbindungen miteinander kommunizieren. Das Testen von Service-Integrationen kann bei derartigen Systemen eine Herausforderung darstellen, da hierzu mehrere Services zur selben Zeit ausgeführt werden müssen und es viele potenzielle Quellen für falsch-negative Testergebnisse gibt.

Consumer-Driven Contract Testing (CDCT) ist ein Ansatz, der dazu verwendet werden kann, beide Seiten einer Service-Integration unabhängig voneinander zu testen. Dies wird erreicht, indem die beiden Seiten der Integration mithilfe eines Vertrags (engl. contract) voneinander entkoppelt werden, wobei der Contract als Vermittler fungiert. Dieser wird durch den Service vorgegeben, welcher die Schnittstelle eines anderen Services beansprucht, und drückt dessen Erwartungen an die verwendete Schnittstelle aus.

Diese Arbeit erforscht, inwieweit CDCT zur Testung von Microservice-Integratio- nen beitragen kann, indem Vorteile, Nachteile, Herausforderungen und Richtlinien erfasst werden, die im Zusammenhang zu CDCT für Microservice-Systeme stehen. Für die initiale Theoriebildung wurde zunächst eine strukturierte Literaturrecherche durchgeführt. Im Anschluss wurde Aktionsforschung betrieben, bei der Consumer-Driven Contract Tests für ein Open-Source Microservice-System entwickelt wurden. Zuletzt, nach der abschließenden Evaluation der Aktionsforschung, wurden die Inhalte, die im Rahmen der strukturierten Literaturrecherche erhoben wurden, mit den Erfahrungen aus der Aktionsforschung abgeglichen.

Keywords: Microservices, integration testing, consumer-driven contract tests, pact

PDF: Master Thesis

Reference: Felix Quast. Testing Microservice Integration with Consumer-Driven Contract Tests. Master Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2022.

Final Thesis: Konzept und Implementierung zur Observability für microservicebasierte Anwendungen

Abstract: ‘Microservices’ sind in der heutigen Zeit ein bekanntes und beliebtes Architekturmuster. Viele weltbekannte Tech-Unternehmen haben sich für diese entschieden. Die Entkopplung und die Aufteilung der Aufgaben in kleinere Services bringen neben daraus resultierenden Vorteilen auch Herausforderungen mit sich. Einen zentralen Negativpunkt hinsichtlich der Entwicklung dieser Dienste stellen die erschwerte Fehlersuche sowie die Schwierigkeit dar, den Überblick über die Anwendung als Gesamtes zu behalten.

In dieser Arbeit werden Softwaretools zur Überwachung und zur Aggregation von Log-Informationen vorgestellt. Darüber hinaus wird eine Kombination von Programmen gewählt, um ein Konzept zu entwickeln und eine beispielhafte Implementierung dieser Werkzeuge in ein bereits laufendes Open-Source-Projekt zu präsentieren.

Keywords: Microservices, observability, monitoring

PDF: Bachelor Thesis

Reference: Daniel Fabrikantow. Konzept und Implementierung zur Observability für microservicebasierte Anwendungen. Bachelor Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2021.

Job / Abschlussarbeit Model Compilation to Streaming Backends

Wir suchen jemanden kompetent im Compilerbau, der oder die Lust hat, sich eines wichtigen Spezialthemas anzunehmen, nämlich offene Daten nutzbar zu machen. Es folgt eine Aufgabenbeschreibung für eine Abschlussarbeit, aber wir bieten das für alles an: Studentischer Job, Abschlussarbeit, Promotion / Wimi Stelle:

Model Compilation to Streaming Backends

The goal of the thesis is to develop a compiler that turns an ETL pipeline model (the “program”) into a configuration for an event streaming framework (the “target architecture”). To start things easy, we want to compile SQL to Kafka Streams in such a way that an SQL schema definition configures a Kafka Streams instance so that the Kafka instance can load a CSV file and save it into a PostgreSQL database. If this works well, we will increase complexity: Not just a source schema (the “E” in ETL) but also transformation rules (the “T”) and a target schema (the “L”); not just Kafka Streams as a target architecture, but also Spark, Flink, and others.

This thesis is part of the JValue project. The mission of the JValue project is to make open data easy, reliable, and safe to use.

Thesis Description – Model Compilation to Streaming Backend

Bei Interesse gern direkt an mich wenden.

Dirk Riehle

Final Thesis: Value Types in TypeScript for JValue

Abstract: Over the past years, TypeScript has increasingly been gaining popularity due to its nature of providing functionalities to ease the development of scalable and robust applications whilst syntactically being a superset of JavaScript. With the growing complexity of data-driven environments, it is essential for programming languages to cope with value types beyond their primitive data types to capture the semantics of intangible data, such as systems of measurement, thus increasing readability and solidity across the codebase. By creating a test-driven framework in TypeScript, this thesis lays out different methods to efficiently implement value types, discusses their benefits as well as drawbacks, and ensures the reliability of the framework by integrating it into an existing data-driven service.

Keywords: Value types, JValue, TypeScript

PDF: Bachelor Thesis

Reference: Mert Baran. Value Types in Typescript for JValue. Bachelor Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2021. 

Final Thesis: Hierarchical Open Data Source Import for the JValue ODS

Abstract: Open Data has become more popular in the last few years due to its value to society. Governments, institutions, companies or individuals can make use of Open Data and add to economic growth or extract new knowledge from publicly available data. The Open Data Service (ODS) is a software developed by the Professorship of Open Source that aims to simplify the consumption of Open Data and make it more reliable.

The goal of this thesis is to extend the functionality of the ODS by the support of hierarchically structured data sources, in particular, File Transfer Protocol (FTP) based data sources. Due to the simplicity and reliability of the FTP, it is an appropriate solution for providing Open Data. This thesis aims to enable the user to explore and configure FTP data sources by developing a new microservice with a proof-of-concept user interface. As a result, consuming Open Data from FTP data sources is simplified and becomes more flexible.

Keywords: Open data, FTP, JValue ODS, microservices

PDF: Master Thesis

Reference: Benjamin Fischer. Hierarchical Open Data Source Import for the JValue ODS. Master Thesis. Friedrich-Alexander-Universität Erlangen-Nürnberg: 2021.