Data Analysis & NLP

Over the years, I have executed and managed numerous data analysis projects in Python and R. You can find all my projects on Github.

Recently, I have been working with LLMs and agentic frameworks. Among other things, I have written a Retriever implementation for the LangChain Framework to access to Zotero libraries.

As a Team Lead for the EPINetz Project, I have managed the conceptualization, evaluation and implementation of a novel, unsupervised multi-label classification algorithm. It was employed on the EPINetz Platform to classify millions of Tweets and news articles into policy areas. It made use of large scale text graphs, random walks, and hand-picked expert data sources for classification. The code is available here.

I also made the method available in a more convenient form, namely as the textgraph package for R. It provides workflows for both unsupervised and seeded cluster extraction from text graphs (or any graph, really), parallelisation functionality and different ways of exploring and visualising the results. It supports both static and dynamic graphs.

I have worked on the data analysis for numerous scientific paper, employing methods such as Network Analysis, Topic Modeling, Heterogeneous Information Networks, and Generalised Additive Regression Models.

As part of the EPINetz Project, I also oversaw the curation and collection of data on German politicians on Twitter. The data set has been published as open source (with an updated version available here).

I am also teaching workshops for university staff, researchers and students on different subjects such as Remote Computing, Best Practices for Scalable Data Analysis, NLP and the use of LLMs.