The SOTorrent Dataset
- There won’t be any updates to the SOTorrent dataset in the forseaable future. I’m still working on completely moving the SOTorrent pipeline to the cloud to make it easier for others to create the dataset themselves, but I only have time to sporadically work on this.
- If you use this dataset in your work, please cite our MSR 2018 paper.
- If you have problems with the dataset or want to propose ideas for improvements, please create an issue here.
- If you want to use BigQuery, please follow this tutorial first.
- If you use BigQuery and get an “Access Denied” error, check the information provided in this Stack Overflow thread.
- For updates, follow me on Twitter.
Like other software artifacts, questions and answers on Stack Overflow evolve over time, for example when bugs in code snippets are fixed or text surrounding a code snippet is edited for clarity. To be able to analyze how Stack Overflow posts evolve, we built SOTorrent, an open dataset based on the official Stack Overflow data dump. SOTorrent provides access to the version history of Stack Overflow content at the level of whole posts and individual text or code blocks. It connects Stack Overflow posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to Stack Overflow posts. More information about SOTorrent and the results of a first analysis using the dataset can be found in this blog post and in the corresponding research paper. The source code of the tools used to build and analyze SOTorrent is available on GitHub.
SOTorrent contains all tables from the official Stack Overflow data dump (see schema documentation here). Additionally, the dataset provides the following tables:
PostVersion: Version history on the level of whole Stack Overflow posts (extracted from table PostHistory).
PostBlockVersion: Version history on the level of post blocks, which are either text (1) or code (2) blocks (extracted from table PostHistory).
PostBlockDiff: Line-based difference between post block versions (extracted from table PostBlockVersion).
TitleVersion: Version history of question titles (extracted from table PostHistory).
PostVersionUrl: URLs extracted from text block versions (see table PostBlockVersion).
CommentUrl: URLs extracted from comments (see table Comments).
PostReferenceGH: URLs extracted from GitHub repositories pointing to Stack Overflow questions, answers, or comments. Information about analyzed GitHub projects can be found here. Meaning of column Copies is described here. Contains only links with valid
CommentId (see also this issue).
GHMatches: Matched source code lines used to build table PostReferenceGH.
StackSnippetVersion: Separate table with versions of executable Stack Snippets.
PostViews: Question view count for all Stack Overflow data dump release since 2016-09-12.
PostTags: Maps tags to posts (links tables Posts and Tags from offical Stack Overflow data dump).
PostBlockDiffOperation: Diff operations provided in table PostBlockDiff.
PostBlockType: Available post block types (text or code).
In the following database schema (based on version 2018-12-09), the tables from the official Stack Overflow data dump are marked gray, the additional tables are marked blue. Please note that not all foreign key constraints are shown.
These are the tables from the official Stack Overflow data dump:
Badges: Badges awarded to Stack Overflow users.
Comments: Comments to Stack Overflow posts.
PostHistory: History of Stack Overflow posts (Markdown content).
PostLinks: Related Stack Overflow posts.
Posts: HTML content of most recent version of Stack Overflow posts.
Tags: Question tags.
Users: Data on Stack Overflow users. Votes: Votes for Stack Overflow posts.
PostHistoryType: Description of edit events in table PostHistory. Table is not part of the official Stack Overflow data dump, but was derived from the corresponding documentation.
PostType: Available post types. Table is not part of the official Stack Overflow data dump, but was derived from the corresponding documentation.
VoteType: Available vote types. Table is not part of the official Stack Overflow data dump, but was derived from the corresponding documentation.
SOTorrent Dataset Versions
The changelog for each dataset version is available here. Please note that in datasets prior to version 2020-08-31, newline characters were escaped as
in the BigQuery versions of the dataset. Only the most recent releases are available on BigQuery, but all dataset versions are archived on Zenodo.
Based on Stack Overflow data dump 2020-12-08:
Based on Stack Overflow data dump 2020-09-08:
- 2020-11-16 (Zenodo)
Based on Stack Overflow data dump 2020-06-02:
- 2020-08-31 (Zenodo)
Based on Stack Overflow data dump 2020-03-02:
- 2020-03-15 (Zenodo)
Based on Stack Overflow data dump 2019-12-02:
- 2019-12-25 (Zenodo)
Based on Stack Overflow data dump 2019-09-04:
- 2019-09-23 (Zenodo)
Based on Stack Overflow data dump 2019-06-03:
- 2019-06-21 (Zenodo)
Based on Stack Overflow data dump 2019-03-04:
- 2019-03-17 (Zenodo)
Based on Stack Overflow data dump 2018-12-02:
Based on Stack Overflow data dump 2018-09-05:
MSR Mining Challenge 2019
Based on Stack Overflow data dump 2018-06-05:
MSR Mining Challenge 2019
- 2018-07-31 (Zenodo)
- 2018-06-17 (Zenodo)
Based on Stack Overflow data dump 2018-03-13 :
- 2018-05-04 (Zenodo)
- 2018-03-28 (Zenodo)
Based on Stack Overflow data dump 2017-12-01:
- 2018-02-16 (Zenodo)
- 2017-12-24 (Zenodo)
The following tables are identical to the corresponding XML files in the official Stack Exhange data dump, which is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0):
GHCommits were retrieved from the Google BigQuery GitHub data set, for which the GitHub Terms of Service apply.
The following tables are based on the tables from the official Stack Exchange data dump listed above. We license them under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0):
SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets.
Sebastian Baltes, Christoph Treude, and Stephan Diehl.
Proceedings of the 16th International Conference on Mining Software Repositories (MSR 2019) (to appear).
Acceptance rate: 33% (1/3).
Selected as MSR Mining Challenge 2019.
Preprint • arXiv • Slides
SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts.
Sebastian Baltes, Lorik Dumani, Christoph Treude, and Stephan Diehl.
Proceedings of the 15th International Conference on Mining Software Repositories (MSR 2018).
Acceptance rate: 33% (37/113).
Invited to an EMSE journal special issue.
Preprint • arXiv • Slides • BibTeX