Friday, November 9, 2007

QCon: eBay Architecture

Notes taken during Randy Shoup's session on eBay's architecture at QCon

Partition Everything
Functional and horizontal segmentation. Partitioning data based on modulo of key, ranges, etc, depending on the data. Load balance pools of application servers, divided functionally. Search index separate from read-write listings.

Avoid database transactions - seek consistency through other approaches: BASE.

Absolutely no session state - transient state maintained/referenced by:

  • URL Rewriting for Small Data
  • Larger data in Cookies (up to 4k)
  • Largest state in a Scratch Database (e.g. Multi-page flows, like listing an item)
Asynchronous Everywhere
By pushing dependencies off into asynchronous calls, it decouples availability, performance. Can retry. Improved user experience of latency although the data/execution latency might actually drop. Can allocate more time to procssing than user would tolerate.

More interestingly, this allows you to spread the cost of load over time. Spikes are less important, because much of the processing can queue up, then catch up in off-peak cycles.

Message dispatch with pub-sub cycles - when listing something, triggers an ITEM.NEW event which can be consumed by the summary update, user metrics, image processing, etc. They have over 100 logical consumers consuming ~300 events. As described the other day, they use at-least-once delivery and any order, rather than trying for only-once and ordered.

Event consumers go back to the primary source of the data rather than relying on the data in the event.

eBay also uses periodic batch processing for infrequent, periodic or scheduled processing, and for problems that are difficult to partition (e.g. full table scan), such as: generating recommendations, importing third-party data, computing sales ranks, archiving and purging deleted items. Often drives further downstream processing through message dispatch events.

Automate Everything
Machines are cheaper, and scale better and more cheaply than humans. They also adapt to a changing environment.

One approach here is adaptive configuration. Define SLAs for logical event consumers (e.g. 99% of events processed in 15 seconds), and then allow the consumer to dynamically adjust to meet defined SLA by tweaking event polling size, polling frequency, number of threads, and minimize cost by adjusting to changing environment (e.g. add more instances to consumer pool: can ramp down polling frequency).

Also employing machine learning to collect user behavior, aggreate that and make decisions, redeploy the metadata results of the learning. Can use this to choose pages/modules/inventory that provide best experience for user and context. Need to perturb the system to try alternatives in order to avoid getting stuck on local maximums.

Remember that Everything Fails
In order to be as available as possible: Assume everything can fail, all resources will become unavailable. Detect failure and recover from failure as rapidly as possible. Do as much as possible even when failure detected.

eBay logs all activity (request, exceptions, app-generated information), especially database and resources, logs on a messaging bus (1.5TB of log messages per day!). Listeners automate failure detection and notification. Compare scenarios with data-warehoused possibilities for root-cause detection (did we roll out new code? which database partitions are affected? etc.)

Make sure that all changes to the site can be rolled back - in every two week period, eBay rolls out 100,000 lines of code. Many changes involve dependencies between pools, so rollout plans contain explicit transitive set of dependencies. Automated tools execute a staged rollout with checkpoints and immediate rollback if necessary. Automated tool also enables rollback, including full rollback of dependent pools.

Audience question: "How do you test this?"
Randy: "We test it every two weeks, with our blood, sweat, effort."
Dan: "When the rollout plan is risky, there will be an explicit test of the rollout, not just the features."

Also, all features have on/off state driven by central configuration, so features can be turned off for operational or business reasons. This decouples code deployment from feature deployment. Applications can check for the availability of features in the same way that they check for the availability of resources.

Often don't go back and remove the on-off capability, although this does eventually result in configuration and code bloat. Allocate "a pretty decent percentage" of time to head-room ("We're going to bump our head up against that wall unless we fix this", such as developer productivity and refactoring) in which some of that 'clean up' can be addressed.

In failure detection, it's much easier to detect 'failed' than 'slow'. Once a server/resource is considered failed, it is "marked down" by sending alerts and no longer sending requests to the resource. If the resource isn't critical, that functionality is suspended. If it is critical, functionality is retried (alternate resource) or deferred (guaranteed asynchronous message). Explicit mark up allows resource to be restored and brought online in a controlled way - system can be marked up for different parts of the infrastructure at a time, very important when the failure was a load-induced failure. If the entire system were told the resource was back online, it might result in immediate return to failure state.

No comments: