Stack Overflow (SO) is the largest Q&A website for developers, providing a huge amount of copyable code snippets. Using these snippets raises various maintenance and legal issues. The SO license requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO’s license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. We conducted a large-scale empirical study to analyze the usage and attribution of non-trivial code snippets from SO answers in public GitHub projects. Below, you’ll find supplementary material for the extended abstract published in the ICSE 2017 companion volume and for the corresponding full paper. For information about the research design and results, please have a look at the posters and preprints available below the supplementary material.

Supplementary Material (Paper under review)

Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects — Supplementary Material.
Sebastian Baltes.

The dataset is licensed under the Creative Commons Attribution Share-Alike 4.0 International License.

Supplementary Material (ICSE Extended Abstract)

  • Preliminary Study: We provide the survey codebook, the raw response data, as well as the R script used for analysis: ZI­­­P

  • Programming Language Ranking: We provide instructions to recreate the ranking as well as the ranking itself: ZI­­­P

  • Code Clone Analysis: We provide all scripts and data for the code clone analysis in one package: ZI­­­P

  • Quantitative Analysis I-III: We provide all scripts and data for all quantitative analyses one package: Folder

  • Qualitative Analysis: We provide the raw data and our coding of the Stack Overflow references in one package: ZI­­­P

  • Other Sources: Stack Exchange data dump, GHTorrent data dump, GitHub BigQuery data set, and GHTorrent BigQuery data set.


The scripts and data we created as well as the data from the surveys are licensed under CC BY 4.0.

For data retrieved from the BigQuery GitHub data set, see the GitHub Terms of Service. All content retrieved from Stack Overflow, including content from the BigQuery Stack Overflow data set, is licensed under CC BY-SA 3.0, see also the Stack Exchange Network Terms of Service. GHTorrent is distributed under a dual licensing scheme (see GHTorrent FAQ and CCPlus).


Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects.
Sebastian Baltes and Stephan Diehl.
Currently under review

Talks and Posters

Attribution Required: Stack Overflow Code Snippets in GitHub Projects (Poster & Extended Abstract).
Sebastian Baltes, Richard Kiefer, and Stephan Diehl.
Proceedings of the 39th International Conference on Software Engineering Companion (ICSE 2017)
Preprint arXiv Poster

Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects (Poster).
Sebastian Baltes and Stephan Diehl.
Winter School on Big Software on the Run (BSR 2016)

The documents distributed on this website have been provided by the contributing authors by means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author’s copyright and the provided license. Not CC licensed works may not be reposted without the explicit permission of the copyright holder.