jonm.dev

The Art of Writing Software



Notes from QCon New York 2015

I had the opportunity to speak–and hence, attend–QCon New York last month and thought I would share some highlights and notes from some of the talks I was able to see. Overall, this remains a high quality conference, with multiple sessions where I had to make tough choices about which talk to attend. I was not able to attend talks every session due to responsibilities from the “home front” at work, and did not always take notes (sometimes I just listened and tweeted particular points). I would definitely consider both attending and speaking at QCon again.

Chris Sanden: Automating Operational Decisions in Real-time

This talk was given by Chris Sanden, Senior Analytics Engineer at Netflix, and discussed three main areas where they apply automated analysis and machine learning in their architecture. By way of context, at this point in time, Chris said Netflix has over 700 services, with thousands of total server instances distributed across 3 AWS regions. From these instances, they collect millions of different metrics, which is plenty of data to feed to your favorite machine learning algorithm.

Automated Canary Analysis. The first area they applied machine learning of sorts was to assist with their canary deployment schemes. The idea here is that new versions of software are gradually rolled out into production, with a series of checkpoints where the new version is analyzed to make a go/no-go decision for advancing to the next checkpoint. Early on, the new version takes a small percentage of the overall service traffic, eventually having more instances deployed and taking a correspondingly greater portion of the traffic.

An interesting detail here was that they did not just run “mostly old” and “a little new” at the beginning; instead, they also designated a similarly-sized group of the old version in order to have an apples-to-apples comparison. For example, they might have 88 servers being the main production group on the old version, the canary deployment running on 6 servers with the new version, but then a third control group of 6 servers also running the old version. The canary would be compared against this control group instead of to the primary production group.

The rough scoring algorithm works as: select a set of metrics; compute the mean value for each metric for the canary and control groups every X minutes; calculate the ratio of mean values between canary and control; characterize the ratio as “high”, “low”, or “same”; calculate the final canary score as the percentage of metrics that match or outperform the performance of the control. They then set a go/no-go threshold for the number of passing metrics in order to progress to the next stage of deployment, with a larger canary group (and a correspondingly larger control group). They found, in practice, that up to 16% of versions were assigned a ‘failing’ score, although humans still have the final decision on whether or not to forge ahead or roll back. At any rate, they seemed to be satisfied that this analysis was finding about the right amount of ‘things to be considered by a human’.

They did note that there were several things that made this somewhat of an art: selecting the right metrics, choosing the right frequency of analysis, choosing the right number of metrics (neither so few that you miss stuff nor so many that the analysis is prohibitively expensive); deciding whether a failing score was a false positive or not. They also noted that there were some caveats: a single outlier instance in either the control or canary group can skew the analysis (because they are using means), and their analysis did not reliably work well for canary groups smaller than six instances (which meant it was mostly useful for services with a larger number of deployed instances).

Server Outlier Detection. When you have a large number of server instances, outliers can be hidden in aggregated metrics. For example, a single abnormally slow service instance may not impact mean or 90th percentile latencies, but you still care about finding it, because your customers are still getting served by it. The next automated analysis application Chris described was specifically designed to find these outliers.

Their technique was to apply an “unsupervised machine learning” algorithm, DBSCAN (density-based spatial clustering of applications with noise) to group server instances into clusters based on their metric readings. Conceptually, if a point belongs to a cluster it should be near lots of other points as measured by some concept of “distance”; the servers that aren’t close enough to a cluster are marked as outliers.

In practice, their procedure looks like:

  1. collect a window of measurements from their telemetry system
  2. run DBSCAN
  3. process the results, applying custom rules (e.g. ignore servers that have already been marked as out-of-service)
  4. perform an action on outliers (e.g. terminate instance or remove from service)

Chris noted that the DBSCAN algorithm requires two input parameters as configuration and which need to be customized for each application. Since they want application owners to be able to use their system without having in-depth knowledge of DBSCAN itself, they instead ask the application owners to estimate how many outliers they expect (based on past experience) and then auto-tune the parameters via simulated annealing so that DBSCAN finds the right number of outliers.

Anomaly Detection. Here they use machine learning to identify cases where aggregate metrics–ones that apply across all instances of a server–have become “abnormal” in some fashion. This works by training an automated model on historical metrics so that it learns what “normal” means and then loading it into a real time analytics platform to then watch for anomalies. The training happens by having service owners “tag” various anomalies as they occur in the dataset. Because what is “normal” for a service can drift over time as usage patterns and software versions change, they automatically evaluate each model against benchmark data nightly, retraining the model when performance (accuracy) has degraded, and then automatically switching to a more accurate model when one is found. They also try to capture when their users (other developers) think a model has drifted and try to make it easy to capture and annotate new data for testing.

Chris identified that bootstrapping the data set and initial model can be done by generating synthetic data and then intentionally injecting anomalies. He also mentioned that Yahoo has released an open source anomaly benchmark dataset. In addition, there are now multiple time-series databases available under open source licenses. One gotcha they have run into is that because humans are the initial source of training data (via classifying metrics as anomalous or not), then can sometimes be inaccurate or inconsistent when tagging data.

Lessons Learned. Chris indicated that the “last mile” can be challenging, and that “machine learning is really good at partially solving just about any problem.” It’s easy to get to 80% accuracy, but there are diminishing returns on effort after that. Of course, for some domains, 80% accuracy is still good enough/useful. Finally, he suggested not letting “Skynet out into the wild without a leash”–if the machine learning system is actually going to take operational actions, you need to make sure there are safeguards in place (“Hmm, I think I will just classify all of these server instances as outliers and just terminate them…“) and to make sure that the safeguards have been well tested!

Mary Poppendieck: Design vs. Data

Mary Poppendieck (of Lean Software Development fame) asked: How do we get generative architectural designs that evolve properly? She cited examples from physical architecture (turns out she comes from a family filled with architects), particularly Gothic cathedrals whose construction spanned decades and even centuries in some cases. Their construction certainly spanned multiple architects and master masons. In some cases, this is obvious, such as cathedrals who have towers to the left and right of their main entrances that do not look even remotely similar. In other cases (and Notre Dame de Paris was identified as one), the overall building does have an overall consistency to it. How does this work?

Mary reviewed Christopher Alexander’s (yep, the pattern language guy) “Theory of Centers” that described fifteen properties of wholeness that good architecture should have. Mary proposed that ten of these–(1) levels of scale; (2) strong centers; (3) boundaries; (4) local symmetries; (5) alternating repetition (recursion); (6) echoes (patterns); (7) positive space; (8) good shape; (9) simplicity; and (10) not-separateness (connectedness)–had analogues for software architecture. Her hypothesis is:

Learning through ongoing experimentation is not an excuse for sloppy system design. On the contrary: strong systems grow from a design vision that helps maintain “Properties of Wholeness” while learning through careful analysis and rigorous experiments.

She suggested the Android Design Principles were a good example of this concept.

Mary then moved on to propose an architectural design language set up to allow for incremental learning and development while maintaining an overarching “wholeness”. The main principles were:

  1. Understand data and how to use it. Data must be central to an architecture, and “a picture is worth a thousand data points”. It’s important to understand the difference between analysis (examining data you already collect) vs. experimentation (specifically collecting metrics to prove or disprove a hypothesis). Everyone must be on the same team, from data scientists to engineers to architects. Mary suggested these principles echoed Alexander’s properties of shape, boundaries, and connectedness.

  2. Simplify the job of data scientists. Data pipelines must be wide and fast. Experiments need design and structure. Access to data must be provided through APIs that support learning and control. Alexander parallels: space, simplicity, levels of scale.

  3. See, Think, Gain Amazing Insights. Be conversant with the best tools and analytical models. Be explicit about assumptions. Make it easy to share the search for patterns and outliers. Test insights rigorously. Alexander parallels: patterns, symmetry, recursion.

Mary classified our uses of data into four main categories: monitoring, control, simulation, and prediction. A good data architecture will support all of these, and so it must provide the following set of capabilities:

In summary, Mary suggests we have to design the entire system, not just the code. i.e. Technical architecture must also account for its data by-products and the surrounding processes we need to be able to support.

Finally, Mary had some choice quotes during the Q&A period after her talk:

Additional resources: (recommended by Mary)

Kovas Boguta & David Nolan: Demand-Driven Architecture

This talk was given jointly by Kovas Boguta and David Nolan. They correctly observed that the proliferation of different clients for many API has put a lot of pressure on the server side of the API, as the server wants to present a one-size-fits-all RESTful interface, yet the clients often need customized versions of those resources to deliver polished experiences. In particular, many clients often need to present what are essentially joins across multiple resources. With N clients, you end up with N front end teams “attacking” the service team with N different sets of demands, resulting in what they described as a “Christmas tree service.” The speakers suggested this was only going to get worse, not better, with the continued proliferation of mobile and IoT devices.

David observed that RDBMSes solved a similar problem previously: building a generalized interface (SQL) and allowing clients to issue requests that were queries specifying what data they wished to receive. Of course, we know well that exposing a SQL interface is rife with security problems, but perhaps the overall pattern can still be applied with a restricted “query language” of sorts that is easier to reason about.

The principles they proposed were:

David then showed some Clojurescript source code that used the “pull syntax” from Datomic for the query language. In this code, he was able to annotate views with the queries that were needed to populate them; this allows for full client flexibility while making maintenance tractable. David pointed out that this doesn’t mean you don’t need a backend; on the contrary, you still need to worry about security, routing, and caching implications.

[Jon’s commentary] Netflix tackled this same problem in a slightly different way, which was to build a scriptable API facade. Client teams were still able to customize device-specific APIs via Groovy scripts built on data/services that were exposed as libraries in their API facade application. This avoids exposing a more general query interface, which makes security analysis easier, although it does still require the client teams to implement and maintain the server sides of their APIs.

Additional material (via Kovas and David):

Jesus Rodriguez: Powering the Industrial Enterprise: The Emergence of the IOT PaaS

Jesus Rodriguez, formerly of Microsoft and now a veteran of several startups, noted that Gartner says IoT is at the peak of inflated expectations (while also noting that Gartner is responsible for a lot of IoT hype!). Jesus also noted that 70% of IoT funding rounds from 2011-2013 were related to wearables, and there was almost no investment in platforms, which he saw as open territory.

Jesus suggested that enabling enterprise-scale IoT brings several challenges: large amounts of data; connectivity; integration; event simulation; scalability; security; and real time analytics. Therefore, he thought that there was a need for a new type of platform, an IoT platform-as-a-service (IoT PaaS). He thought we would see both centralized (interactions are orchestrated by some sort of centralized hub or service) and decentralized (devices interact directly) models develop, so there was a need perhaps for multiple types of PaaS here.

In the centralized model, smart devices talk to a central hub that provides backend capabilities but also manages and controls the device topology. In the decentralized model, devices operate without a central authority. Jesus felt that in this model the smart devices would host a version of the IoT PaaS itself; in this setting I presume it would be some sort of library, framework, or co-deployed process.

For the remainder of the talk, Jesus identified several capabilities that he felt ought to be provided by an IoT PaaS, as well as providing pointers to some existing technology. Since I wasn’t familiar with a lot of the technologies he mentioned, this was an exercise in “write everything down and look it up later” for me (although I was the only person who raised a hand when he asked if anyone had heard of CoAP, which I had learned about via the appendix of Mike Amundsen’s RESTful Web APIs book).

Centralized capabilities

  1. device management service: managing smart devices in an IoT topology; device monitoring; device security; device ping (tech: consul.io; XMPP discovery extensions; IBM IoT Foundation device management API)

  2. protocol hub: provide consistent data/message exchange experiences across different devices. Unify management, discovery, and monitoring interfaces across different devices (IOTivity protocol plugin model; IOTLab protocol manager; Apigee Zetta / Apigee Link)

  3. device discovery: registering devices in an IoT topology; dynamically discovering smart devices in IoT network (UDP - multicast/broadcast, CoAP, IOTivity discovery APIs)

  4. event aggregation: execute queries over data streams; compose event queries; distribute query results to event consumers. complex event processing. (Apache Storm, AWS Kinesis; Azure Event Hubs and Stream Analytics; Siddhi (WSO2))

  5. telemetry data storage: store data streams from smart devices; store the output of the event aggregator service; optimize access to the data based on time stamps; offline query (time series: openTSDB, KairosDB, InfluxDB; offline: Couchbase; IBM Bluemix Time Series API)

  6. event simulation: replay streams of data; store data streams that simulate real world conditions; detect and troubleshoot error conditions associated with specific streams (Azure Event Hubs; Kinesis; Apache Storm; PubNub)

  7. event notifications: distribute events from a source to different devices; devices can subscribe to specific topics (PubNub; Parse Notifications; MQTT)

  8. real time data visualizations: map visualizations; integrate with big data / machine learning. (MetricsGraphicsJS; Graphite / Graphene; Cube; Plot.ly; D3JS)

Jesus thought that the adoption of a centralized IoT PaaS would be realized by having a standard set of services/interfaces but multiple implementations, ideally in a hosting-agnostic package. It would be important to be extensible and allow for third party integration support, while providing centralized management and governance. Jesus thought that CloudFoundry might be a good place to build this ecosystem (or at least could serve as a good model for how to do it).

Decentralized IoT PaaS Capabilities

  1. P2P Secure Messaging: secure, fully encrypted messaging protocol (Telehash)

  2. contract enforcement & messaging trust: express capabilities; enforce actions; maintain a trusted ledger of actions (Blockchain; Ethereum)

  3. file sharing: efficiently sending files to smart devices (firmware update); exchange files in a decentralized model; secure and trusted file exchanges (Bittorrent)

Other capabilities

As with any PaaS system, gathering historical analytics will be important. In addition, there will be a need for device cooperation (machine-to-machine, or “M2M”) which gets into agent-based artificial intelligence sorts of systems.

Jesus saw the possibility for several types of companies to bring an IoT PaaS to market:

Orchestrating Containers with Terraform and Consul

This talk was given by Mitchell Hashimoto, CEO and co-founder of Hashicorp. Hashicorp maintains several open source projects in this space (Terraform and Consul are two of them) while also selling related commercial software products. Mitchell defined orchestration as “do some set of actions, to a set of things, in a set order.” The particular goal for orchestration in the context of his talk was to safely deliver applications at scale. He noted that containers solve some problems, namely packaging (e.g. Docker Image), image storage (e.g. Docker Registry), and execution (e.g. Docker Daemon), with a sidenote that image distribution might still be an open problem here.

However, containers do not solve other problems that are nonetheless important for application delivery, namely:

Even as organizations adopt containers, though, there is still a need to continue to support legacy applications; the transition from non-containers to containers isn’t going to be atomic, so orchestration needs to also include support for non-containerized systems. The time period for this transition will probably be years, and what about orchestration in a post-container world someday? Mitchell quoted Gartner and Forrester (~citation needed~) as estimating that the Fortune 500 would be completing their transition to virtualization…in 2015, over a decade after viable enterprise-grade virtualization became available. In other words, orchestration is an old problem; it’s not caused by containers. However, the higher density and lifecycle speed of containers reveals and exacerbates orchestration problems. Modern architectures also include new patterns and elements like public cloud, software-as-a-service (SaaS), and generally a growing external footprint. Orchestration problems will continue to exist for the foreseeable future.

Terraform

Terraform solves the infrastructure piece of an overall orchestration solution, providing the ability to build, combine, and launch infrastructure safely and efficiently. As a way of illustrating what problems Terraform solves, Mitchell asked:

Terraform permits you to create infrastructure with code, including servers, load balancers, databases, email providers, etc., similar to what OpenStack Heat provides. This includes support for SaaS and PaaS resources. With Terraform, there is a single command that is used for both creating and updating infrastructure. It allows you to preview changes to infrastructure and save them as diffs; therefore, code plus diffs can be used to treat infrastructure changes like code changes: make a PR, show diffs, review, accept and merge. Terrform has a module system that allows subdividing your infrastructure to allow teamwork without risking stability; the configuration system allows you to reference the dynamic state of other resources at runtime, e.g. ${digitalocean_droplet.web.ipv4_address}. Its configuration system is human-friendly and JSON compatible; as a text format it is version-control friendly. Since the configuration is declarative, this allows Terraform to be idempotent and highly parallelized; the diff-based mechanism means that Terraform will only do what the plan says it will do, allowing you to examine what it will do ahead of time (”make -n”, anyone?) as a clear, human-readable diff.

Consul

Consul is “service discovery, configuration, and orchestration made easy.” It is billed as being distributed, highly-available, and datacenter-aware. In a similar fashion to his description of Terrform, Mitchell identified several questions that Consul can answer for you:

Consul offers a service lookup mechanism with both HTTP and DNS-based interfaces. For an example of the latter:

$ dig web-frontend.service.consul. +short
10.0.3.89
10.0.1.46

Consul can work for both internal and external services (the external ones can be manually registered), and incorporates failure detection/health-checking so that DNS won’t return non-healthy services or nodes (the HTTP interface offers more detailed information about the overall health state of the managed catalog). Health checks are carried out via local agents; the health check can be an arbitrary shell script. Participating nodes then gossip health information around to each other.

Consul can work for both internal and external services (external ones can be explicitly registered). Consul incorporates failure detection so that DNS lookups won’t return non-healthy services or nodes. The HTTP API also includes endpoints to list the full health state of the catalog of services and nodes. Health checks are run locally and take the form of executing a shell script.

Consul provides key/value storage that can be used for application configuration, and watches can be set on keys (via long poll) to receive notification of changes. Consul also provides for ACLs on keys to protect sensitive information and allow for multi-tenant use. Mitchell suggested that the type of information best suited for Consul should power “the knobs you want to turn at runtime” such as port numbers or feature flags. There is an auxiliary project called consul-template that can regenerate configuration files from templates whenever underlying configuration data changes.

Consul provides multi-datacenter support as well, although from what I understood, each datacenter is essentially its own keyspace, as the values are set by the strongly consistent Raft protocol, which generally don’t run well across wide-area networks (WANs). Key lookups are local by default but the local datacenter Consul masters can forward requests to other datacenters as needed, so you can still view keys and values from all datacenters within one user interface.

In addition to basic key/value lookup, Consul also supports events that can be published as notifications, as well as execs, which are conceptually a “scalable ssh for-loop” in Mitchell’s words. He said there are pros and cons to using each of events, execs, and watches, but that when used in appropriate settings they have found they can scale to thousands of Consul agents.

Camille Fournier: Consensus Systems for the Skeptical Architect

Camille Fournier, CTO of Rent the Runway (RTR), subtitled her talk “Skeptical Paxos/ZAB/Raft: A high-level guide on when to use a consensus system, and when to reconsider”. Camille rhetorically asked: if new distributed databases like Riak, Cassandra, or MongoDB don’t use a standalone consensus system like ZooKeeper (ZK), etcd, or Consul, are the latter consensus systems any good? She pointed out that the newer distributed databases are often focused on: high availability, where strong consistency is a tradeoff; fast adoption to pursue startup growth (e.g. “don’t ask me to install ZooKeeper before installing your distributed database”); and were designed from the ground up as distributed systems by distributed systems experts. She also shared that RTR does not use a standalone consensus system, largely because their business needs and technical environment either don’t require or aren’t suitable for such a system. In the remainder of the talk, Camille shared some evaluation criteria that teams and organizations can use when trying to decide if systems like ZK et al. are a good fit.

First, we should evaluate where the system would run. If the environment does not require operational support for rapid growth and rapid change, then a standalone consensus system may be overkill. Consensus systems are often used to provide distributed locks or service orchestration for large distributed systems, but in Camille’s words, “you don’t always have an orchestra; sometimes you have a one-man band.” Simpler alternatives to distributed service orchestration include load balancers or DNS; locks can be provided databases (just use a transactional relational database or something like DynamoDB that supports strongly consistent operations).

Second, we should consider what primitives are needed for our application. Consensus systems provide strong consistency and several have support for ephemeral nodes (disappear when a client session disconnects) and notifications or watches. Different consensus systems provide these to different degrees, and later in the talk Camille summarized some of these tradeoffs, which I captured below. Perhaps her strongest point in this section is that consensus systems are not really a key/value store per se; they are designed to point to data, not to contain it. You can use them for limited pub-sub operations, but…you can also fix things with duct tape and bailing wire.

Camille also provided some analysis about the similarities and differences between ZK and etcd to illustrate some of the subtleties involved with choosing one or the other. Both obviously use a proven consensus algorithm (ZAB for ZooKeeper, Raft for etcd) to provide consistent state. With ZooKeeper, clients maintain a stateful connection to the cluster. While this can be powerful, it can be hard to do right–the ZooKeeper client state machine is complicated, and Camille recommended using the Curator library for Java instead of writing your own. This ensures a single system image per client. On the other hand, etcd has an HTTP-based interface, which is easy to implement and does not require complex session management. However, you must pay for the overhead of the HTTP protocol, and if you use temporary time-to-live nodes, you have to implement heartbeats/failure detection in your clients; achieving a “single system image” requires more work. On the other hand, ZK watches do not guarantee that you will see the intermediate states of a watched node that undergoes multiple changes, whereas the etcd watches are provided via longpoll and can also show the change history within a certain timeframe.

Camille then wrapped up with a number of common challenges that can be faced when deploying consensus systems.

Odd numbers rule
Use 3 or 5 cluster members; there's no value in just having four, as you still need 3 available to gain a majority. This requires more servers to be up than with 3 cluster members while tolerating fewer failures than 5 cluster members--the worst of both worlds.
Clients can run amok
Camille also phrased this as "multi-tenancy is hard" and "hell is other developers." She suggested potentially not sharing the same consensus system deployment to guarantee resource isolation, doing lots of code review, and providing wrapper libraries for clients to ensure good client behavior.
Garbage collection (GC) and network burps
This is a warning about lock assumptions. Many distributed locks are advisory and are based on the concept of temporary leases that must be successfully renewed by the holder or the lock gets released. In some cases, a GC pause or network partition can exceed the lock lease timeout, which can result in two systems thinking they both hold the lock. Dealing successfully with this challenge requires validating lock requests at the protected resource in order to detect out-of-order use. Despite strong consistency, the realities of the physical world mean that ZK et al. can only provide advisory locking at best.
Look for the blood spatter pattern
Bugs will be in the features no one uses or the things that happen rarely. Camille shared a story where when she first tried to use ACLs and authentication in ZooKeeper--a documented feature--she found it didn't work at all, because no one actually used it!
Consensus Owns Your Availability
Per the CAP theorem, there can be an availability tradeoff for consensus systems during partitions, even if the individual cluster members are up. If you make your application's availability dependent on the consensus system's availability, consensus can become a single point of (distributed!) failure.