Research

This page provides an overview of my research projects, including datasets I maintain, empirical studies I conducted, and tools I developed in cooperation with colleagues and students. See also my research statement.

Understanding Software Documentation Ecosystems (since 2018)

Condor Stint

Summary

Software engineering is knowledge-intensive and requires software developers to continually search for knowledge, often on community question answering platforms such as Stack Overflow. Such information sharing platforms do not exist in isolation, and part of the evidence that they exist in a broader software documentation ecosystem is the common presence of hyperlinks to other documentation resources found in forum posts. With the goal of helping to improve the information diffusion between Stack Overflow and other documentation resources, we conducted a study to answer the question of how and why documentation is referenced in Stack Overflow threads. Besides empirically studying the purpose and context of links in Stack Overflow posts, we point to potential tool support to enhance the information diffusion between Stack Overflow and other documentation resources.

Publications:

The SOTorrent Dataset (since 2017)

Logo SOTorrent

Summary

Like other software artifacts, questions and answers on Stack Overflow evolve over time, for example when bugs in code snippets are fixed or text surrounding a code snippet is edited for clarity. To be able to analyze how Stack Overflow posts evolve, we built SOTorrent, an open dataset based on the official Stack Overflow data dump. SOTorrent provides access to the version history of Stack Overflow content at the level of whole posts and individual text or code blocks. It connects Stack Overflow posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to Stack Overflow posts. More information about SOTorrent and the results of a first analysis using the dataset can be found in this blog post and in the corresponding research paper. The source code of the tools used to build and analyze SOTorrent is available on GitHub.

More Information:

Publications:

Towards a Theory of Software Development Expertise (since 2015)

Conceptual Theory

Summary

Software development includes diverse tasks such as implementing new features, analyzing requirements, and fixing bugs. Being an expert in those tasks requires a certain set of skills, knowledge, and experience. Several studies investigated individual aspects of software development expertise, but what is missing is a comprehensive theory. We present a first conceptual theory of software development expertise that is grounded in data from a mixed-methods survey with 335 software developers (see supplementary material) and in literature on expertise and expert performance. Our theory currently focuses on programming, but already provides valuable insights for researchers, developers, and employers. The theory describes important properties of software development expertise and which factors foster or hinder its formation, including how developers’ performance may decline over time. Moreover, our quantitative results show that developers’ expertise self-assessments are context-dependent and that experience is not necessarily related to expertise.

More Information:

Publications:

License Implications of Online Code Snippet Reuse (since 2016)

Stack Overflow Logo Creative Commons Logo GitHub Logo

Abstract

Stack Overflow is the largest Q&A website for developers, providing a huge amount of copyable code snippets. Using these snippets raises various maintenance and legal issues. The Stavk Overflow license (CC BY-SA) requires attribution, that is referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on Stack Overflow’s license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from Stack Overflow without proper attribution. We conducted a large-scale empirical study to analyze the usage and attribution of non-trivial code snippets from Stack Overflow answers in public GitHub projects (see supplementary material). For more information about the research design and results, please consult the blog post and the publications linked below.

More Information:

Publications:

Sampling in Software Engineering Research (since 2015)

Cone of Sampling

Summary

Representative sampling appears rare in software engineering research. Not all studies need representative samples, but ageneral lack of representative sampling undermines a scientific field. We therefore investigated the state of sampling in recent, high-quality software engineering research. The key findings are: (1) random sampling is rare; (2) sophisticated sampling strategies are very rare; (3) sampling, representativeness and randomness do not appear well-understood. To address these problems, we synthesized existing knowledge of sampling into a succinct primer and proposes extensive guidelines for improving the conduct, presentation and evaluation of sampling in software engineering research. It is further recommended that while researchers should strive for more representative samples, disparaging non-probability sampling is generally capricious and particularly misguided for predominately qualitative research. In our earlier paper, we reported on successful sampling strategies to recruit software developers for online surveys as well as potential ethical issues with some strategies that are being used.

Publications:

Sketches and Diagrams in Practice (2014-2017)

sketch3 sketch1 sketch2

Summary

Sketches and diagrams play an important role in the daily work of software developers. In our paper “Sketches and Diagrams in Practice” we present the results of our research on the usage of sketches and diagrams in software engineering practice. We focused especially on their relation to the core elements of a software project, the source code artifacts. Furthermore, we wanted to assess how helpful sketches are for understanding the related source code. We intended to find out if, how, and why sketches and diagrams are archived and are thereby available for future use. Software is created with and for a wide range of stakeholders. Since sketches are often a means for communicating between these stakeholders, we were not only interested in sketches and diagrams created by software developers, but by all software practitioners, including testers, architects, project managers, as well as researchers and consultants. In a survey with 394 software ‘practitioners’ (see supplementary material), we asked questions about the last sketch or diagram that they had created. Contrary to our expectations and previous work, the majority of sketches and diagrams contained at least some UML elements. However, most of them were informal. The most common purposes for creating sketches and diagrams were designing, explaining, and understanding, but analyzing requirements was also named often. More than half of the sketches and diagrams were created on analog media like paper or whiteboards and have been revised after creation. Most of them were used for more than a week and were archived. About half of the sketches were rated as helpful to understand the related source code artifact(s) in the future. Our study complements a number of existing studies on the use of sketches and diagrams in software development, which analyzed the above aspects only in parts and often focused on an academic environment, a single company, open source projects, or were limited to a small group of participants.

Publications:

Performance Debugging (2014-2015)

sketch3 sketch1 sketch2


Summary

Performance bugs can lead to severe issues regarding computation efficiency, power consumption, and user experience. Locating these bugs is a difficult task because developers have to judge for every costly operation whether runtime is consumed necessarily or unnecessarily. We wanted to investigate how developers, when locating performance bugs, navigate through the code, understand the program, and communicate the detected issues. To this end, we performed a qualitative user study observing twelve developers trying to fix documented performance bugs in two open source projects (see supplementary material). The developers worked with a profiling and analysis tool that visually depicts runtime information in a list representation and embedded into the source code view. We identified typical navigation strategies developers used for pinpointing the bug, for instance, following method calls based on runtime consumption. The integration of visualization and code helped developers to understand the bug. Sketches visualizing data structures and algorithms turned out to be valuable for externalizing and communicating the comprehension process for complex bugs. Fixing a performance bug is a code comprehension and navigation problem. Flexible navigation features based on executed methods and a close integration of source code and performance information support the process.

Publications:

Other Projects and Tools

Influence of Continuous Integration on Commit Activity

A core goal of Continuous Integration (CI) is to make small incremental changes to software projects, which are integrated frequently into a mainline repository or branch. This paper presents an empirical study that investigates if developers adjust their commit activity towards the above-mentioned goal after projects start using CI. We analyzed the commit and merge activity in 93 GitHub projects that introduced the hosted CI system Travis CI, but have previously been developed for at least one year before introducing CI (see supplementary material and dataset). In our analysis, we only found one non-negligible effect, an increased merge ratio, meaning that there were more merging commits in relation to all commits after the projects started using Travis CI. This effect has also been reported in related work. However, we observed the same effect in a random sample of 60 GitHub projects not using CI. Thus, it is unlikely that the effect is caused by the introduction of CI alone. We conclude that: (1) in our sample of projects, the introduction of CI did not lead to major changes in developers’ commit activity, and (2) it is important to compare the commit activity to a baseline before attributing an effect to a treatment that may not be the cause for the observed effect. The git-log-extractor and git-log-parser tools we developed for this project are available on GitHub. The corresponding paper was published at SWAN 2018 and won the best paper award.

Constructing Urban Tourism Space Digitally

In this interdisciplinary research project, we argue that urban tourism space is (re-)produced digitally and collaboratively on online platforms and empirically analyze how two Berlin neighborhoods are digitally constructed by Airbnb hosts in their listing descriptions. More information in our blog post and the corresponding publication presented at CSCW 2018

Sketches and diagrams play an important role in the daily work of software developers. If they are archived, they are often detached from the source code they document, because there is no adequate tool support to assist developers in capturing, archiving, and retrieving sketches related to certain source code artifacts. We implemented SketchLink to increase the value of sketches and diagrams created during software development by supporting developers in these tasks. Our prototype implementation provides a web application that employs the camera of smartphones and tablets to capture analog sketches, but can also be used on desktop computers to upload, for instance, computer-generated diagrams. We also implemented a plugin that embeds the links in Javadoc comments and visualizes them in situ in the source code editor as graphical icons for the IntelliJ Java IDE. Besides being a useful software documentation tool, SketchLink also enables developers to navigate through their source code using the linked sketches and diagrams. More information can be found in our demo video, the supplementary material, and the corresponding publication presented at FSE 2014.

LivelySketches

Sketching is an important activity for understanding, designing, and communicating different aspects of software systems such as their requirements or architecture. Often, sketches start on paper or whiteboards, are revised, and may evolve into a digital version. Users may then print a revised sketch, change it on paper, and digitize it again. Existing tools focus on a paperless workflow, i.e., archiving analog documents, or rely on special hardware—they do not focus on integrating digital versions into the analog-focused workflow that many users follow. In this paper, we present the conceptual design and a prototype of LivelySketches, a tool that supports the “round-trip” lifecycle of sketches from analog to digital and back. The proposed workflow includes capturing both analog and digital sketches as well as relevant context information. In addition, users can link sketches to other related sketches or documents. They may access the linked artifacts and captured information using digital as well as augmented analog versions of the sketches. We further present results from a formative user study (see supplementary material) with four students and outline possible directions for future work. The corresponding paper was published at VISSOFT 2017.

VisualCues

Humans are very efficient in processing and remembering visual information. That is why metaphors and visual representations are important in education. Because of their high visual expressiveness, presentation tools like Microsoft PowerPoint are very popular for teaching in classrooms. However, representing source code with such tools is tedious and cumbersome, while alternatives like source code editors lack visual expression. Moreover, modifying prepared content, e.g. while responding to questions, is not well supported. In this paper, we introduce VisualCues, an approach with the goal of combining the flexibility of source code editors with the visual expressiveness of classical slide-based presentation tools. A key concept of VisualCues is linking visual artifacts to specific elements of source code. The main advantage is that when changing the underlying source code, the positions of linked visual artifacts are changed simultaneously. We implemented a first prototype and evaluated it in two undergraduate computer science courses (see supplementary material). The corresponding paper was prestend at VL/HCC 2015, see also our demo video.

RegViz

Online tool that visually augments regular expressions to simply their understanding and debugging.