Sampling Papers Using DBLP

2 minute read

In our ESEM 2016 paper, we reported on methodological and ethical issues when sampling software developers. This time, I want to approach the topic of sampling from a different perspective. For a research project, we wanted to draw a random sample from all full papers published in certain software engineering journals and conference proceedings in a certain time frame. To able draw such a sample of papers, we needed a corresponding sampling frame. Instead of accessing each conference proceeding and journal issue manually, we decided to use the curated data that DBLP provides.

We first implemented an approach using DBLP’s API, but due to a bug that affected the retrieval for certain conference identifiers (e.g. ICSE 2015), we switched to a solution that uses DBLP’s HTML pages. We implemented a web scraper that takes a list of venue identifiers as input and then retrieves all papers published in those venues, including the following metadata:

  • paper title
  • authors
  • heading (corresponds to journal issue or conference session name)
  • page range
  • paper length
  • link to electronic edition of paper

The CSV file that the tool creates has the following structure:

venue year identifier heading title authors pages length electronic_edition
ICSE 2014 conf/icse/icse2014 Perspectives on Software Engineering Cowboys, ankle sprains, and keepers of quality: how is video game development different from software development? Emerson R. Murphy-Hill; Thomas Zimmermann; Nachiappan Nagappan 1-11 11 https://doi.org/10.1145/2568225.2568226
ICSE 2014 conf/icse/icse2014 Modeling TradeMaker: automated dynamic analysis of synthesized tradespaces. Hamid Bagheri; Chong Tang; Kevin J. Sullivan 106-116 11 https://doi.org/10.1145/2568225.2568291
TOSEM 2018 journals/tosem/tosem27 Volume 27, Number 4, November 2018 Variability-Aware Static Analysis at Scale: An Empirical Study. Alexander von Rhein; Jörg Liebig; Andreas Janker; Christian Kästner; Sven Apel 18:1-18:33 33 https://dl.acm.org/citation.cfm?id=3280986

The property heading is particularly relevant, because it allows us to later exclude papers published, for example, in NIER or tool demo tracks. Moreover, the property paper length, which is derived from the page range, enables us to remove, for example, keynote descriptions or extended abstracts of journal-first papers. Automatically filtering papers based on their length is difficult, because there exist relatively short journal papers (example), but also relatively long keynote papers (example).

After manually removing non-full papers from the list, we have a sampling frame that we can use to draw a stratified sample, for example by randomly selecting five papers for each venue and year. We tested our tool with ICSE, FSE, TSE, and TOSEM for the time frame 2014 to 2018.

More Information

The dblp-retriever tool is available on GitHub.

Updated: