The first application I ever wrote was a complete and utter failure.
I was an eager eighth grader, full of wonder and excitement about the infinite possibilities in code, with an insatiable desire to build, build, build. I'd made plenty of little games and widgets for myself, but now was my chance to create something for someone else: my friend and I were making a game and he needed a tool to create pixel art for it. We had no money for fancy Adobe licenses, and so I decided to make a tool.
In designing the app, I made every imaginable software engineering mistake. I didn't talk to him about requirements. I didn't test on his computer before sending the finished app. I certainly didn't conduct any usability tests, performance tests, or acceptance tests. The app I ended up shipping was a pure expression of what I wanted to build, not what he needed to be creative or productive. As a result, it was buggy, slow, confusing, and useless, and blinded by my joy of coding, I had no clue.
Now, ideally my "customer" would have reported any of these problems to me right away, and I would have learned some tough lessons about software engineering. But this customer was my best friend, and also a very nice guy. He wasn't about to trash all of my hard work. Instead, he suffered in silence. He struggled to install, struggled to use, and worst of all struggled to create. He produced some amazing art a few weeks after I gave him the app, but it was only after a few months of progress on our game that I learned he hadn't used my app for a single asset, preferring instead to suffer through Microsoft Paint. My app was too buggy, too slow, and too confusing to be useful. I was devastated.
Why didn't I know it was such a complete failure? *Because I wasn't looking*. I'd ignored the ultimate test suite: _my customer_. I'd learned that the only way to really know whether software requirements are right is by watching how it executes in the world through *monitoring*<turnbull16>.
Of course, this is easier said than done. That's because the (ideally) massive numbers of people executing your software is not easily observable<menzies13>. Moreover, each software quality you might want to monitor (performance, functional correctness, usability) requires entirely different methods of observation and analysis. Let's talk about some of the most important qualities to monitor and how to monitor them.
These are some of the easiest failures to detect because they are overt and unambiguous. Microsoft was one of the first organizations to do this comprehensively, building what eventually became known as Windows Error Reporting<glerum09>. It turns out that actually capturing these errors at scale and mining them for repeating, reproducible failures is quite complex, requiring classification, progressive data collection, and many statistical techniques to extract signal from noise. In fact, Microsoft has a dedicated team of data scientists and engineers whose sole job is to manage the error reporting infrastructure, monitor and triage incoming errors, and use trends in errors to make decisions about improvements to future releases and release processes. This is now standard practice in most companies and organizations, including other big software companies (Google, Apple, IBM, etc.), as well as open source projects (eg, Mozilla). In fact, many application development platforms now include this as a standard operating system feature.
Performance, like crashes, kernel panics, and hangs, is easily observable in software, but a bit trickier to characterize as good or bad. How slow is too slow? How bad is it if something is slow occasionally? You'll have to define acceptable thresholds for different use cases to be able to identify problems automatically. Some experts in industry<grabner16> still view this as an art.
It's also hard to monitor performance without actually _harming_ performance. Many tools and web services (e.g., [New Relic|https://newrelic.com/]) are getting better at reducing this overhead and offering real time data about performance problems through sampling.
Monitoring for data breaches, identity theft, and other security and privacy concerns are incredibly important parts of running a service, but also very challenging. This is partly because the tools for doing this monitoring are not yet well integrated, requiring each team to develop its own practices and monitoring infrastructure. But it's also because protecting data and identity is more than just detecting and blocking malicious payloads. It's also about recovering from ones that get through, developing reliable data streams about application network activity, monitoring for anomalies and trends in those streams, and developing practices for tracking and responding to warnings that your monitoring system might generate. Researchers are still actively inventing more scalable, usable, and deployable techniques for all of these activities.
The biggest limitation of the monitoring above is that it only reveals _what_ people are doing with your software, not _why_ they are doing it, or why it has failed. Monitoring can help you know that a problem exists, but it can't tell you why a program failed or why a persona failed to use your software successfully.
# Discovering Missing Requirements
Usability problems and missing features, unlike some of the preceding problems, are even harder to detect or observe, because the only true indicator that something is hard to use is in a user's mind. That said, there are a couple of approaches to detecting the possibility of usability problems.
One is by monitoring application usage. Assuming your users will tolerate being watched, there are many techniques: 1) automatically instrumenting applications for user interaction events, 2) mining events for problematic patterns, and 3) browsing and analyzing patterns for more subjective issues<ivory01>. Modern tools and services like make it easier to capture, store, and analyze this usage data, although they still require you to have some upfront intuition about what to monitor. More advanced, experimental techniques in research automatically analyze undo events as indicators of usability problems<akers09> this work observes that undo is often an indicator of a mistake in creative software, and mistakes are often indicators of usability problems.
All of the usage data above can tell you _what_ your users are doing, but not _why_. For this, you'll need to get explicit feedback from support tickets, support forums, product reviews, and other critiques of user experience. Some of these types of reports go directly to engineering teams, becoming part of bug reporting systems, while others end up in customer service or marketing departments. While all of this data is valuable for monitoring user experience, most companies still do a bad job of using anything but bug reports to improve user experience, overlooking the rich insights in customer service interactions<chilana11>.
Although bug reports are widely used, they have significant problems as a way to monitor: for developers to fix a problem, they need detailed steps to reproduce the problem, or stack traces or other state to help them track down the cause of a problem<bettenburg08>; these are precisely the kinds of information that are hard for users to find and submit, given that most people aren't trained to produce reliable, precise information for failure reproduction. Additionally, once the information is recorded in a bug report, even _interpreting_ the information requires social, organizational, and technical knowledge, meaning that if a problem is not addressed soon, an organization's ability to even interpret what the failure was and what caused it can decay over time<aranda09>. All of these issues can lead to intractable debugging challenges<qureshi16>.
Larger software organizations now employ data scientists to help mitigate these challenges of analyzing and maintaining monitoring data and bug reports. Most of them try to answer questions such as<begel14>:
The most mature data science roles in software engineering teams even have multiple distinct roles, including _Insight Providers_, who gather and analyze data to inform decisions, _Modeling Specialists_, who use their machine learning expertise to build predictive models, _Platform Builders_, who create the infrastructure necessary for gathering data<kim16>. Of course, smaller organizations may have individuals who take on all of these roles. Moreover, not all ways of discovering missing requirements are data science roles. Many companies, for example, have customer experience specialists and community managers, who are less interested in data about experiences and more interested in directly communicating with customers about their experiences. These relational forms of monitoring can be much more effective at revealing software quality issues that aren't as easily observed, such as issues of racial or sexual bias in software or other forms of structural injustices built into the architecture of software.
All of this effort to capture and maintain user feedback can be messy to analyze because it usually comes in the form of natural language text. Services like [AnswerDash|http://answerdash.com] (a company I co-founded) structure this data by organizing requests around frequently asked questions. AnswerDash imposes a little widget on every page in a web application, making it easy for users to submit questions and find answers to previously asked questions. This generates data about the features and use cases that are leading to the most confusion, which types of users are having this confusion, and where in an application the confusion is happening most frequently. This product was based on several years of research in my lab<chilana13>.