Sampling Papers Using DBLP
In our ESEM 2016 paper, we reported on methodological and ethical issues when sampling software developers. This time, I want to approach the topic of sampling from a different perspective. For a research project, we wanted to draw a random sample from all full papers published in certain software engineering journals and conference proceedings in a certain time frame. To able draw such a sample of papers, we needed a corresponding sampling frame. Instead of accessing each conference proceeding and journal issue manually, we decided to use the curated data that DBLP provides.
We first implemented an approach using DBLP’s API, but due to a bug that affected the retrieval for certain conference identifiers (e.g. ICSE 2015), we switched to a solution that uses DBLP’s HTML pages. We implemented a web scraper that takes a list of venue identifiers as input and then retrieves all papers published in those venues, including the following metadata:
- paper title
- authors
- heading (corresponds to journal issue or conference session name)
- page range
- paper length
- link to electronic edition of paper
The CSV file that the tool creates has the following structure:
venue | year | identifier | heading | title | authors | pages | length | electronic_edition |
---|---|---|---|---|---|---|---|---|
ICSE | 2014 | conf/icse/icse2014 | Perspectives on Software Engineering | Cowboys, ankle sprains, and keepers of quality: how is video game development different from software development? | Emerson R. Murphy-Hill; Thomas Zimmermann; Nachiappan Nagappan | 1-11 | 11 | https://doi.org/10.1145/2568225.2568226 |
… | … | … | … | … | … | … | … | … |
ICSE | 2014 | conf/icse/icse2014 | Modeling | TradeMaker: automated dynamic analysis of synthesized tradespaces. | Hamid Bagheri; Chong Tang; Kevin J. Sullivan | 106-116 | 11 | https://doi.org/10.1145/2568225.2568291 |
… | … | … | … | … | … | … | … | … |
TOSEM | 2018 | journals/tosem/tosem27 | Volume 27, Number 4, November 2018 | Variability-Aware Static Analysis at Scale: An Empirical Study. | Alexander von Rhein; Jörg Liebig; Andreas Janker; Christian Kästner; Sven Apel | 18:1-18:33 | 33 | https://dl.acm.org/citation.cfm?id=3280986 |
The property heading
is particularly relevant, because it allows us to later exclude papers published, for example, in NIER or tool demo tracks.
Moreover, the property paper length
, which is derived from the page range, enables us to remove, for example, keynote descriptions or extended abstracts of journal-first papers.
Automatically filtering papers based on their length is difficult, because there exist relatively short journal papers (example), but also relatively long keynote papers (example).
After manually removing non-full papers from the list, we have a sampling frame that we can use to draw a stratified sample, for example by randomly selecting five papers for each venue and year. We tested our tool with ICSE, FSE, TSE, and TOSEM for the time frame 2014 to 2018.
More Information
The dblp-retriever
tool is available on GitHub.