Paper Info
- Paper Name: The Effect of Google Search on Software Security: Unobtrusive Security Interventions via Content Re-ranking
- Conference: CCS ‘21
- Author List: Felix Fischer, Yannick Stachelscheid, Jens Grossklags
- Link to Paper: here
- Food: Tiramisu
Summary
After analyzing the top Google search results containing StackOverflow answers (in an experiment with 192 people) the authors determined that insecure code can appear among the top search results. The authors’ propose a search result re-ranking algorithm with semi-supervised clustering to boost the secure answers over the insecure answers. After testing their re-ranking method vs the control method on 218 developers, the paper shows that developers who used the re-ranking method produced more secure code than those who had not.
RevRec | RevExp | OveMer | RevInt | WriQua | EthCon | |
---|---|---|---|---|---|---|
Review #2A | 3/6 | 2/4 | 2/5 | 2/3 | 2/5 | 2 |
Review #2B | 4/6 | 2/4 | 3/5 | 2/3 | 3/5 | 2 |
Review #2C | 4/6 | 2/4 | 2/5 | 2/3 | 3/5 | 2 |
Review #2D | 3/6 | 2/4 | 2/5 | 2/3 | 2/5 | 2 |
Preliminary Reviews
All of the reviews admitted to having some familiarity with the topics discussed in the paper, but there were no reviews that qualified themselves as knowledgeable or experts in the area. Every reviewer decided that the paper was worth accepting with some revisions either major or minor. Though there were ethical concerns, every reviewer decided the authors had addressed them appropriately. There arose concern regarding the merit of the paper with many reviewers putting the paper in only the top 50% of submitted papers.
The paper clearly demonstrated a novel problem with a well executed and replicable experiment. They present their contribution distinctly with a clear positive result favoring their ranking technique over the default google search rankings. However, most reviewers felt the proposed solution was narrow in its applicability (only to Java-based insecurities) and should be expanded to introduce novelty. This narrowness was a result of the TUM-Crypto dataset chosen which is 3-year old, language specific, and needed to be manually labeled. There was little meaningful exploration of alternatives and their limitations resulting in an insufficient justification of their stated minimal impact. The paper also lacked clarity in how false positives and false negatives were classified. Several reviewers noted that the demographic selection felt haphazard and lacked consideration for potential biases.
The Statistics
This paper applied and leveraged statistics a multitude of times. Most of those representations were informative tables or graphs that added value to their claim. There were several instances where the paper could have used a table to better relay important information rather than pure description (see section 5.6). The dataset (TUM-Crypto) that the paper is designed around does not expound upon the false-positive or false-negative rates in said dataset. We disagreed with the authors on how p-values are used and presented in Table 2. P-values should be used for rejecting a null hypothesis at a fixed significance level; it is not the probability that the null hypothesis is true. Hence, presenting a table full of p-values at different significance levels mixed with other statistics measurements does not provide any evidence, it only causes confusion.
Motivation
An important point that was brought up during the discussion and agreed upon by many of the reviewers was that the paper conducted a novel study. However, its novelty made it difficult to determine whether the process followed was the correct one.
The solution proposed in the paper has several threats to validity, some of which was not mentioned in the Threats to Validity section of the paper itself. The approach used a 3-year-old dataset that contained code examples of one particular programming language. This dataset, containing nearly 10k code samples, was generated by deep learning algorithms. The generalizability of this approach is unknown and its scalability is uncertain.
However, if we consider the objective of the paper to highlight an important part of secure coding practices, it does a good job in pointing it out. The paper clearly shows the impact of ranking results on security relevant tasks. It may not solve the problem entirely, but shows that the problem is one that needs to be addressed.
Ethics consideration and IRB
Since the authors had performed two online studies/surveys with software developers, the question of IRB approval was raised in the discussions. The authors do mention in the paper that their institution does not require IRB review for online survey studies. However, they chose to follow a method that had received approval by the Ethics Review Board of a different institution.
This discussion highlighted some important points about how research involving human subjects can be ethically performed when the institution does not have/require an IRB approval mechanism.
Bonus Round: Are We The Baddies?
When reviewing the discussion video for this blog, we discovered that we were constantly discussing the flaws and imperfections in the paper. We did not try to appreciate the science and merits presented in the paper. One of our professors once told us that young reviewers are generally more negative compared with senior reviewers. We think our discussion on the paper was a typical example of this. Reviewing papers is about learning from them: both good and bad. For the discussion in the future, we think we should not only discuss the flaws of the papers, but more importantly, what is the merits of the paper, why it was accepted to a top-tier venue, and what we can learn from them to improve our own research.