Building the SOTorrent Dataset
Hi, my name is Sebastian Baltes,
In my research, I empirically analyze software developers' work habits with the goal of identifying requirements for new tools and pointing to possible tool and process improvements. For me, thoroughly analyzing and understanding the state-of-practice is an essential first step towards improving how software is being developed. Too often, decisions in software projects are still rather opinion-based than data-informed. My long-term goal is to bridge the gap between empirical research and practice, both by studying relevant phenomena and by communicating the results back to practitioners.
Some of my research projects already led to recommendations for researchers and practitioners, others to the development of novel tool prototypes. In a recent project, for example, we studied the impact of the COVID-19 pandemic on the productivity and wellbeing of software developers being forced to work from home, including an assessment of support strategies that organizations can utilize. I'm also interested in topics around diversity and inclusion: In a recent article, we looked at employability strategies for older software developers as suggested by public online articles. Further, in 2019, I received a Google Faculty Research Award for our project on rewriting software documentation for non-native speakers.
I am a methods pluralist. Most empirical studies I conduct follow a mixed methods design, combining qualitative and quantitative research methods. I am especially interested in interdisciplinary research, involving theories and methods from the social sciences (e.g., grounded theory, social constructionism) and psychology (e.g., theories on expertise development). Moreover, with an increasing number of software companies maintaining open source projects, legal aspects of software development gain importance. One legal question I studied is the license status of code snippets on Stack Overflow and developers’ awareness of its implications.
To complement qualitative results derived from interviews, observational studies, or open-ended survey questions, I apply data-mining techniques to open source software projects or other data sets. I further maintain the open dataset SOTorrent that other researchers can use to study the origin, evolution, and usage of Stack Overflow content. That dataset was selected as the official mining challenge of MSR 2019. I am also interested in information visualisation and visual analytics, exploring how interactive visualisations can support humans in analysing data. I regularly develop custom visualisation that we have been using in different research projects to explore data or to derive patterns. I support open science and open data practices: I try to publish data, software, analysis scripts, and paper preprints whenever possible. Moreover, I argue for an active discussion about research methodology and ethical issues in the software engineering research community.
Usage, Attribution, and Implications