The SOTorrent Dataset

Like other software artifacts, questions and answers on Stack Overflow evolve over time, for example when bugs in code snippets are fixed or text surrounding a code snippet is edited for clarity. To be able to analyze how Stack Overflow posts evolves, we built SOTorrent, an open dataset based on the official Stack Overflow data dump. SOTorrent provides access to the version history of Stack Overflow content at the level of whole posts and individual text or code blocks. It connects Stack Overflow posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to Stack Overflow posts. More information about SOTorrent and the results of a first analysis using the dataset can be found in this blog post and in the corresponding research paper.

Dataset versions

The changelog for each version can be found on GitHub.

2018-03-28

2018-02-16

2017-12-24