mirror of
https://github.com/amyjko/cooperative-software-development
synced 2024-12-25 21:58:15 +01:00
Small revisions to verification chapter.
This commit is contained in:
parent
3b3600a48d
commit
7aaba444b7
1 changed files with 11 additions and 9 deletions
|
@ -27,7 +27,9 @@
|
|||
<h1>Verification</h1>
|
||||
<div class="lead">Andrew J. Ko</div>
|
||||
|
||||
<p>How do you know a program does what you intended? Part of this is being clear about what you intended (by writing <a href="specifications.html">specifications</a>, for example), but your intents, however clear, are not enough: you need evidence that your intents were correctly expressed computationally. To get this evidence, we do <strong>verification</strong>.</p>
|
||||
<p>How do you know a program does what you intended?</p>
|
||||
|
||||
<p>Part of this is being clear about what you intended (by writing <a href="specifications.html">specifications</a>, for example), but your intents, however clear, are not enough: you need evidence that your intents were correctly expressed computationally. To get this evidence, we do <strong>verification</strong>.</p>
|
||||
|
||||
<p>There are many ways to verify code. A reasonable first instinct is to simply run your program. After all, what better way to check whether you expressed your intents then to see with your own eyes what your program does? This is an empirical approach is called <strong>testing</strong>. Some testing is <em>manual</em>, in that a human executes a program and verifies that it does what was intended. Some testing is <em>automated</em>, in that the test is run automatically by a computer. Another way to verify code is to <strong>analyze</strong> it, using logic to verify its correct operation. As with testing, some analysis is <em>manual</em>, since humans do it. We call this manual analysis <em>inspection</em>, whereas other analysis is <em>automated</em>, since computers do it. We call this <em>program analysis</em>. This leads to a nice complementary set of verification technique along two axes:</p>
|
||||
|
||||
|
@ -63,7 +65,7 @@
|
|||
|
||||
<h2>Testing</h2>
|
||||
|
||||
<p>So how do you <em>find</em> defects in a program? Let's start with testing. Testing is generally the easiest kind of verification to do, but as a practice, it has questionable efficacy. Empirical studies of testing find that it <em>is</em> related to fewer defects in the future, but not strongly related, and it's entirely possible that it's not the testing itself that results in fewer defects, but that other activities (such as more careful implementation) result in fewer defects and testing efforts (<a href="#ahmed">Ahmed et al. 2016</a>). At the same time, modern developers don't test as much as they think they do (<a href="#beller">Beller et al. 2015</a>). Moreover, students are often not convinced of the return on investment of automated tests and often opt for laborious manual tests (even though they regret it later) (<a href="#pham">Pham et al. 2014</a>). Testing is therefore in a strange place: it's a widespread activity in industry, but it's not very systematically, and it doesn't seem to help very much when we try to measure its benefits.</p>
|
||||
<p>So how do you <em>find</em> defects in a program? Let's start with testing. Testing is generally the easiest kind of verification to do, but as a practice, it has questionable efficacy. Empirical studies of testing find that it <em>is</em> related to fewer defects in the future, but not strongly related, and it's entirely possible that it's not the testing itself that results in fewer defects, but that other activities (such as more careful implementation) result in fewer defects and testing efforts (<a href="#ahmed">Ahmed et al. 2016</a>). At the same time, modern developers don't test as much as they think they do (<a href="#beller">Beller et al. 2015</a>). Moreover, students are often not convinced of the return on investment of automated tests and often opt for laborious manual tests (even though they regret it later) (<a href="#pham">Pham et al. 2014</a>). Testing is therefore in a strange place: it's a widespread activity in industry, but it's often not executed systematically, and there is some evidence that it doesn't seem to help prevent defects from being released.</p>
|
||||
|
||||
<p>Why is this? One possibility is that <strong>no amount of testing can prove a program correct with respect to its specifications</strong>. Why? It boils down to the same limitations that exist in science: with empiricism, we can provide evidence that a program <em>does</em> have defects, but we can't provide complete evidence that a program <em>doesn't</em> have defects. This is because even simple programs can execute in a infinite number of different ways.</p>
|
||||
|
||||
|
@ -76,16 +78,16 @@ function count(input) {
|
|||
return input;
|
||||
}</pre>
|
||||
|
||||
<p>The function should always return 0, right? How many possible values of <code>input</code> do we have to try manually to verify that it always does? Well, if <code>input</code> is an integer, then there are 2<sup>32</sup> possible integer values. That's not infinite, but that's a lot. But what if <code>input</code> is a string? There are an infinite number of possible strings because they can have any sequence of characters of any length. Now we have to manually test an infinite number of possible inputs. So if we were restricting ourselves to testing, we will never know that the program is correct for all possible inputs. In this case, automatic testing doesn't even help, since there are an infinite number of tests to run.</p>
|
||||
<p>The function should always return 0, right? How many possible values of <code>input</code> do we have to try manually to verify that it always does? Well, if <code>input</code> is an integer, then there are 2<sup>32</sup> possible integer values, because JavaScript uses 32-bits to represent an integer. That's not infinite, but that's a lot. But what if <code>input</code> is a string? There are an infinite number of possible strings because they can have any sequence of characters of any length. Now we have to manually test an infinite number of possible inputs. So if we were restricting ourselves to testing, we will never know that the program is correct for all possible inputs. In this case, automatic testing doesn't even help, since there are an infinite number of tests to run.</p>
|
||||
|
||||
<p>There are some ideas in testing that can improve how well we can find defects. For example, rather than just testing the inputs you can think of, focus on all of the lines of code in your program. If you find a set of tests that can cause all of the lines of code to execute, you have one notion of <strong>test coverage</strong>. Of course, lines of code aren't enough, because an individual line can contain multiple different paths in it (e.g., <code>value ? getResult1() : getResult2()</code>). So another notion of coverage is executing all of the possible <em>control flow paths</em> through the various conditionals in your program. Executing <em>all</em> of the possible paths is hard, of course, because every conditional in your program doubles the number of possible paths (you have 200 if statements in your program? That's 2<sup>200</sup> possible paths, which is more paths than there are <a href="https://en.wikipedia.org/wiki/Observable_universe#Matter_content">atoms in the universe</a>).</p>
|
||||
<p>There are some ideas in testing that can improve how well we can find defects. For example, rather than just testing the inputs you can think of, focus on all of the lines of code in your program. If you find a set of tests that can cause all of the lines of code to execute, you have one notion of <strong>test coverage</strong>. Of course, lines of code aren't enough, because an individual line can contain multiple different paths in it (e.g., <code>value ? getResult1() : getResult2()</code>). So another notion of coverage is executing all of the possible <em>control flow paths</em> through the various conditionals in your program. Executing <em>all</em> of the possible paths is hard, of course, because every conditional in your program doubles the number of possible paths (you have 200 if statements in your program? That's up to 2<sup>200</sup> possible paths, which is more paths than there are <a href="https://en.wikipedia.org/wiki/Observable_universe#Matter_content">atoms in the universe</a>).</p>
|
||||
|
||||
<p>There are many types of testing that are common in software engineering:</p>
|
||||
|
||||
<ul>
|
||||
<li><strong>Unit tests</strong> verify that low-level functions return the correct output. For example, a program that implemented a function for finding the day of the week for a given date might also include unit tests that verify for a large number of dates that the correct day of the week is returned. They're good for ensuring widely used low-level functionality is correct.</li>
|
||||
<li><strong>Integration tests</strong> verify that when all of the functionality of a program is put together into the final product, it behaves according to specifications. Integration tests often operate at the level of user interfaces, clicking buttons, entering text, submitting forms, and verifying that the expected feedback always occurs. Integration tests are good for ensuring that important tasks that users will perform are correct.</li>
|
||||
<li><strong>Regression tests</strong> verify that previously fixed defects do not break again. For example, imagine you find a defect that causes logins to fail; you might write a test that verifies that this cause of login failure does not occur, in case someone breaks the same functionality again, even for a different reason. Regression tests are good for ensuring that you don't break things when you make changes to your application.</li>
|
||||
<li><strong>Regression tests</strong> verify that behavior that previously worked doesn't stop working. For example, imagine you find a defect that causes logins to fail; you might write a test that verifies that this cause of login failure does not occur, in case someone breaks the same functionality again, even for a different reason. Regression tests are good for ensuring that you don't break things when you make changes to your application.</li>
|
||||
</ul>
|
||||
|
||||
<p>Which tests you should write depends on what risks you want to take. Don't care about failures? Don't write any tests. If failures of a particular kind are highly consequential to your team, you should probably write tests that check for those failures. As we noted above, you can't write enough tests to catch all bugs, so deciding which tests to write and maintain is a key challenge.</p>
|
||||
|
@ -94,13 +96,13 @@ function count(input) {
|
|||
|
||||
<p>Now, you might be thinking that it's obvious that the program above is defective for some integers and strings. How did you know? You <em>analyzed</em> the program rather than executing it with specific inputs. For example, when I read (analyzed) the program, I thought:</p>
|
||||
|
||||
<p><em>"if we assume <code>input</code> is an integer, then there are only three types of values to meaningfully consider with respect to the <code>></code> in the loop condition: positive, zero, and negative. Positive numbers will always decrement to 0 and return 0. Zero will return zero. And negative numbers just get returned as is, since they're less then zero, which is wrong with respect to the specification. And in JavaScript, strings are never greater than 0 (as if that even makes sense), so the string is returned, which is wrong."</em></p>
|
||||
<p><em>"if we assume <code>input</code> is an integer, then there are only three types of values to meaningfully consider with respect to the <code>></code> in the loop condition: positive, zero, and negative. Positive numbers will always decrement to 0 and return 0. Zero will return zero. And negative numbers just get returned as is, since they're less then zero, which is wrong with respect to the specification. And in JavaScript, strings are never greater than 0 (let's not worry about whether it even makes sense to be able to compare strings and numbers), so the string is returned, which is wrong."</em></p>
|
||||
|
||||
<p>The above is basically an informal proof, using logic to divide the possible states of <code>input</code> and their effect on the program's behavior. It used <strong>symbolic execution</strong> to verify all possible paths through the function, finding the paths that result in correct and incorrect values. The strategy was an inspection because we did it manually. If we had written a program that read the program to perform this deduction automatically, we would have called it <em>program analysis</em>.</p>
|
||||
<p>The above is basically an informal proof. I used logic to divide the possible states of <code>input</code> and their effect on the program's behavior. I used <strong>symbolic execution</strong> to verify all possible paths through the function, finding the paths that result in correct and incorrect values. The strategy was an inspection because we did it manually. If we had written a <em>program</em> that read the program to perform this proof automatically, we would have called it <em>program analysis</em>.</p>
|
||||
|
||||
<p>The benefits of analysis is that it <em>can</em> demonstrate that a program is correct in all cases. This is because they can handle infinite spaces of possible inputs by mapping those infinite inputs onto a finite space of possible executions. It's not always possible to do this in practice, since many kinds of programs can execute in infinitely possible ways, but it gets us closer.</p>
|
||||
<p>The benefits of analysis is that it <em>can</em> demonstrate that a program is correct in all cases. This is because they can handle infinite spaces of possible inputs by mapping those infinite inputs onto a finite space of possible executions. It's not always possible to do this in practice, since many kinds of programs <em>can</em> execute in infinite ways, but it gets us closer to proving correctness.</p>
|
||||
|
||||
<p>One popular type of automatic program analysis tools is a <strong>static analysis</strong> tool. These tools read programs and identify potential defects using the types of formal proofs like the ones above. They typically result in a set of warnings, each one requiring inspection by a developer to verify, since some of the warnings may be false positives (something the tool thought was a defect, but wasn't). Although static analysis tools can find all kinds of defects, they aren't yet viewed by developers to be that useful because the false positives are often large in number and the way they are presented make them difficult to understand (<a href="#johnson">Johnson et al. 2013</a>).</p>
|
||||
<p>One popular type of automatic program analysis tools is a <strong>static analysis</strong> tool. These tools read programs and identify potential defects using the types of formal proofs like the ones above. They typically result in a set of warnings, each one requiring inspection by a developer to verify, since some of the warnings may be false positives (something the tool thought was a defect, but wasn't). Although static analysis tools can find many kinds of defects, they aren't yet viewed by developers to be that useful because the false positives are often large in number and the way they are presented make them difficult to understand (<a href="#johnson">Johnson et al. 2013</a>). There is one exception to this, and it's a static analysis tool you've likely used: a compiler. Compilers verify the correctness of syntax, grammar, and for statically-typed languages, the correctness of types. As I'm sure you've discovered, compiler errors aren't always the easiest to comprehend, but they do find real defects automatically. The research community is just searching for more advanced ways to check more advanced specifications of program behavior.</p>
|
||||
|
||||
<p>Not all analytical techniques rely entirely on logic. In fact, one of the most popular methods of verification in industry are <strong>code reviews</strong>, also known as <em>inspections</em>. The basic idea of an inspection is to read the program analytically, following the control and data flow inside the code to look for defects. This can be done alone, in groups, and even included as part of process of integrating changes, to verify them before they are committed to a branch. Modern code reviews, while informal, help find defects, stimulate knowledge transfer between developers, increase team awareness, and help identify alternative implementations that can improve quality (<a href="bacchelli">Bacchelli & Bird 2013</a>). One study found that measures of how much a developer knows about an architecture can increase 66% to 150% depending on the project (<a href="#rigby2">Rigby & Bird 2013</a>). That said, not all reviews are created equal: the best ones are thorough and conducted by a reviewer with strong familiarity with the code (<a href="#kononenko">Kononenko et al. 2016</a>); including reviewers that do not know each other or do not know the code can result in longer reviews, especially when run as meetings (<a href="#seaman">Seaman & Basili 1997</a>). Soliciting reviews asynchronously by allowing developers to request reviewers of their peers is generally much more scalable (<a href="#rigby">Rigby & Storey 2011</a>), but this requires developers to be careful about which reviews they invest in.</p>
|
||||
|
||||
|
|
Loading…
Reference in a new issue