|
Application
Availability:
Introduction Measuring Availability
Availability Defined: User Relevance and Measurement Utility Measurement and its Discontents What to Measure Classifying Service-Level Metrics
SLIMTAX Definitions SLIMTAX Features Moving Up and Down the Hierarchy Applying SLIMTAX to an Application Stack SLIMTAX: Measurement versus Management Test Frequency, Timing and Performance in SLIMTAX Metrics and Tools
Defining a Service-level Indicator Defining a Synthetic Transaction Conclusion Acknowledgements About the Author As application availability becomes increasingly important, companies need guidance about what is an acceptable level of availability in their organization. Even companies that manage to achieve the highest levels of availability do not always reap the greatest benefit from their investment. To maximize availability economically and effectively, it is essential to utilize proven methods for measuring and monitoring it. Application availability is widely sought after as a requirement for applications delivered over networks. While users know what they want - continuous application access with predictable performance - it's often difficult to establish concrete measures that show whether the service providers charged with delivering the application over a network can meet user requirements. Sifting out relevant, actionable data about availability is often as complicated as making decisions and acting based on that information. In order to structure discussion around definitions and appropriateness of measures, this paper sets forth:
As business becomes increasingly dependent on technology and information, availability is a universal concern for every business, in every industry. Functions such as data warehousing, data mining, enterprise resource planning, and e-mail are essential to conducting daily business. And globalization means there are no more periods of "acceptable" downtime. At any time of the day or night, somewhere in the world, customers and vendors need access to your corporate information. If they can't get it, they'll go elsewhere - creating an opportunity for your competition. For these reasons, Forrester believes availability is only part of the story and that the quest for customer satisfaction needs a new measurement: quality of experience (QoE) But ironically, as the Internet demolishes many of the boundaries between IT and business, the measures that account for their operations are diverging. Traditionally, IT infrastructures, particularly platform resources, have accounted for how much work is getting done by systems using the same metrics as they use for control of resource management - CPU utilization, queries processed per hour, I/O operations, network packets, and the like. However, as systems become more distributed and networked, and as end users in 24 time zones access systems round the clock, they want to drive the measure of system availability since it affects their work immediately and directly. End users regard the contribution of IT infrastructure in terms of the value that it delivers, not operational metrics. In fact, Forrester concluded that customer satisfaction will require more than just availability. Sophisticated online firms will cap infrastructure spending after achieving 99.9 percent availability, and use the money saved to enhance their QoE, instead. This is partly because, as Forrester predicts, a firm that spends more than $3.6 million to upgrade its site from 99.99 percent to 99.999 percent availability will only capture about$3,000 more revenue - and even those results are overstated because availability does not address repeat customer needs. The renaissance of Service Level Agreements (SLAs), which were once - like many aspects of centralized Internet computing - the province of mainframe host-based environments, is driving IT management and end users alike to seek a common definition of their shared objectives, without creating undue operational dependence between their domains. That common definition is application availability measurement (AAMe). From the end-user perspective, it does not replace resource management, capacity planning, change management, performance analysis, or any of the other many practices that are the metier of the disciplined, mission-critical shop. However, in establishing and maintaining the value of an application to its users, none of these other disciplines can represent the entire system to the users as they see it. Hence the notion that QoE could replace quality of service (QoS) as a critical site measurement. This paper will provide ways to identify meaningful indicators of application availability. It presents a generalized model for creating measures of application availability in user terms, and validating the application's value to the users. Specifically, it covers:
Some
considerations in developing measures The terms service, application, and system are used fairly interchangeably; while some might argue for more semantic precision, the three notions are so fluid that their meanings are almost the same. But to the extent precision can be applied, system represents an end-to-end application environment. Service is defined as an application delivered over a network; it is the substrate of measurement for availability. Service-level indicators, or service-level indicator metrics, are the results of tests that validate the service. How
available is available? For an entire year of uptime - 365 days times 24 hours times 60 minutes equals roughly 525,600 minutes - uptime can be represented as "nines," as in the chart below. One handy way to think of nines in a 365x24 year is in orders of magnitude: Five nines represents five minutes of downtime; four nines represents about 50 minutes; three nines, 500 minutes, etc. Every tenth of a percentage point per year is roughly 500 minutes of downtime. Of course, for services that don't need to operate 24 hours a day, seven days a week, such as factory-floor applications in a single location, the outage minute numbers will vary based on the local operational window. It should be readily apparent that getting past one minute of downtime per week can be quite an expensive proposition. Redundant systems that double the hardware required - in extreme cases, down to specialized fault-tolerant processes that compare instructions at every clock - and complex software that can handle the redundancy are just the beginning. The skills needed to deal with the complexity and the system's inability to handle change easily also drive up the cost. Moreover, experience shows that people and process issues in such environments cause far more downtime than the systems themselves can prevent. Some IT operations executives are fond of saying that the best way to improve availability is to lock the data center door.
Figure 1. Table of fractional outages Be that as it may, any foray into high-availability goal setting should begin with a careful analysis of how much downtime users can really tolerate, and what is the impact of any outage. The "nines" are a tempting target for setting goals; the most common impulse for the casual consumer of these "nines" is to go for a lot of them. Before you succumb to the temptation, bear in mind one thing: you can't decide how much availability you need without first asking "availability of what?" The concepts presented here should better prepare you to answer that question; once you've answered it, you can make more constructive use of your downtime target. As your availability goals mature, you'll find it more productive to choose user downtime targets rather than snappy formulations of uptime. Availability
defined: User relevance and measurement
utility Measurement
and its discontents
It has been proven that measurement distorts the measured event or element, making AAMe inherently an imperfect indicator. So at a minimum, to realize the value of AAMe it is necessary to make certain that the measured application has enough capacity - i.e., processing cycles - to sustain measurement. And measures must be selected in such a fashion that their impact on the system is tolerable. Be that as it may, for any dynamic system, no momentary snapshot can create a perfect measure of the application's availability. When the cost of measuring is easier to demonstrate than the benefit, it raises the bar for any benefits that might accrue. But introducing slightly imperfect measures into a highly imperfect system does not necessarily disqualify the act of measurement. Where should an application be measured? Viewed as a service, an application delivered over a network can be understood in logistical terms, at its endpoints. To better understand the endpoints of a service, consider package delivery - the kind of package you can wrap in brown paper and hold in your hand. In the early days of transportation, the term FOB (literally, freight on board) was coined to describe the accountability for goods at any given point in the journey between seller and buyer. For manufactured goods, "FOB Factory" meant that a purchaser took ownership of the finished product from within the factory, through its transport, i.e., that transport was not the responsibility of the shipper, but of the receiver. Most end-to-end services are composed of subsidiary services. Let's take another simplified example: delivery of fresh fish from ocean to dinner plate. The ultimate measure of fresh fish delivery is whether it tastes good when you eat it - another example of QoE. But in real life, the logistical problems of fish delivery - specifically, measuring when the fish you'll eat for dinner got from point A to point B on its journey from ocean to plate - is the subject of contract relationships between independent service providers, stated in measurable, service-level terms. It is worthwhile to know when your fish left the ocean for the net and the net for the ship's hold, if it was frozen, who cooked it, and so on. Delivery from any one point in the chain to another may have multiple owners. Each point that controls change between service providers - fishermen, shippers, grocers, or chefs, or waiters - is an implicit measurement point. The analogy to service delivery, especially over the public Internet, lies in limitations on control and accountability for certain portions of the service-delivery chain. Can a service provider deliver relevant AAMe metrics over an uncontrolled transport such as a public network? The answer may be no. But in every case, there's a boundary, up to which the service provider can take ownership, be measured and held accountable. And any given end-to-end service can be decomposed into subsidiary services. This principle, of decomposition into component services, can be applied to most applications, regardless of whether they are Internet-enabled. In the diagram below, a stylized, end-to-end architecture (or stack) for a Web-based application can be decomposed into a set of measurement points for service-level indicators. A user or client of the application performing a transaction depends on all the lower-level layers to complete a transaction. In this case, a user or client (i.e., a human or a browser) establishes a connection with a Web server over a network. The Web server connects with the application server, which processes business logic. The business logic in the application server connects to the DBMS for data retrieval as appropriate. And, of course, the DBMS runs on the operating system; it is only as available as the operating environment on which it executes. Service availability can be measured or tracked only as a subset of the complete, end-to-end stack. With a design that allocates sufficient independence between layers, it's possible to speak of the availability of a series or set of services, each of which is a subset of the user's requirement to be up and running from end to end.
Figure 2. Service Decomposition 1. Operating system service on hardware, presuming hardware availability. Most platform vendors that claim 99.9 percent uptime are referring to this. 2. End-to-end database service, presuming operating system and hardware availability. 3. Application service availability, including DBMS, operating system, and hardware availability. 4. Session availability, including all lower-level layers. 5. Application server divorced from the database. In this scenario, the business logic and connectivity to a data store are measured (and managed) independently of the database component. Note that a combination of (2) and (5) are essentially the same as service (4) to the user/client. 6. A complete, end-to end measure, including the client and the network. While the notion of a service implies the network, it is included in this diagram to show that you can establish the measure of availability for the stack as a whole with or without the network. For Internet-based applications, separating the network is important, because service providers can rarely, if ever, definitively establish and sustain service levels across the public network. Moreover, when a user connects across the Internet, it's important to understand how much of the user experience is colored by the vagaries of the Internet, and how much is under the direct control of operational staff. Decomposition into services is the first step toward defining what availability is measured, and why. As will be seen, indicating end-user availability over time does not require every service component to be measured and tracked separately. Comparison
of feedback techniques
Figure 3. Feedback Mechanisms In simplest terms, a feedback loop is composed of reporting and intervention, undertaken at certain intervals. Reporting frequency characterizes event sampling - how often, and how immediate are events known? Intervention frequency characterizes action based on event data - how often do you draw conclusions from the data, and how immediate can you intervene to make changes to the system, based on your conclusions?
One consideration is the value of information. The relationship between bookkeeping, accounting, and audit of an organization's financial operations is one example. Certainly, many of the same tools and techniques apply at all three levels. One key difference is in the audience: rarely do CEOs require detailed, day-to-day information; they seek a high-level indicator, such as profitability or cash flow, to provide audit information about the health of the business and draw broad strategic conclusions. At the same time, the operations staff needs an intimate understanding of how individual data impact profitability to make the higher-level measure meaningful. Now, let's see how this notion applies in the services context. Classifying Service-Level Metrics The best AAMe indicators track real work by real users as closely as possible. Most dot coms and data centers have service-level objectives of one sort or another that characterize system behavior; some formally quantify these objectives, either for internal management and alerting or as part of formal SLAs. Such objectives take the form of database uptime, correlated output of system management tools, delivered bandwidth and data streams, and a variety of levels in between. But which of these, if any, are useful in measuring application availability? To this point, we've considered availability as an attribute of a service, as in, "Is it available?" In fact, availability is itself a service level, with quantifiable, measurable levels of attainment. A formal definition might be: a measure that checks the behavior of a system, using consistent tests repeated at set frequency over time, comparing accumulated test results with a goal. Such a measure would be expressed using the formulation, "test every sixty seconds, with a maximum of 50 test failures per week," to indicate 99.5 percent uptime. But taken individually, it may be difficult to translate these result measures into positive indicators of service availability. Goal
of service-level metric characterization
To better characterize what makes a useful service-level metric, a simplifying hierarchy for ranking levels of application availability and service level metrics, is formulated. The model, service-level indicator metric taxonomy (SLIMTAX) - classifies metrics by how and what they test. An important benefit is that as an organization's service-level tracking capability matures, SLIMTAX provides a roadmap for shifting application avail- ability measurement up the hierarchy to levels that represent throughput and user work. The SLIMTAX hierarchy incrementally adds features of an application service from the bare minimum of application existence, up through network delivery, service level thresholds, and complete user-centric status measurement. A0: Key process exists locally. For example, a list of processes shows HTTP is running. Some applications have multiple processes, so just looking for those that show an application is up may not be enough. A1: Local state. Key process can work locally with inputs and produce correct output. For example, this is how a server cluster tests the availability of the applications it hosts is running. At this level, the test is local; in some instances, it is possible to derive availability from a passive measure, such as scanning log files. A2: Remote session. User can establish access (log in) over a network. Here, the notion of a synthetic transaction is introduced, though it need only run at intervals shorter than target failover times to expose a failure. For example, an application that fails over in 15 minutes can show its availability in an A2 test that runs every 10 minutes, as the 15-minute failover will register as an outage. A3: Transaction response time. Key business operations are performing at a given rate. Here, a service-level threshold or objective is used to measure whether a sufficient fraction of key transactions completes quickly enough. For example, 99.9 percent of the monitored transactions complete in eight seconds or less. A4: User work. Key population of users or clients is performing given units of user work over time. Such an indicator would account for 200 active users sending and receiving an average of 30 e-mails each per hour. This shows that an e-mail system delivers 12,000 messages per hour with a given population. This measure can be based on session logs or instrumented clients, with a closed-loop that captures a user-centric picture of the end-to-end application. It's important to understand that the SLIMTAX hierarchy doesn't measure availability; it provides a metametric, i.e., a framework for comparing metrics. In service-level contracting, it can establish differences in requirements between end users and service providers, internal or external. Similarly, in architecting a systems and tools environment, SLIMTAX makes it possible to identify where feedback techniques for availability are applied, and to distinguish one feedback technique from another. An availability metric that shows operating system availability is a level A0 metric (at best) because it doesn't test if the operating system is doing something, it just checks if the operating system is there. Similarly, many systems management tools check application availability just by looking to see if a certain process exists. Another example: to determine the failure of a subsystem, such as a node within a cluster, and take appropriate remedial action, an A0 or A1 measure may be adequate; test failures may be sufficiently well-defined to trigger automated recovery. But for other service- level consumers, the fact that a cluster is available at the A0 or A1 level will be understood not to encompass user-level metrics. The SLIMTAX hierarchy also helps scope demarcation for bottom-up, top-down availability feedback techniques. In other words, measurement at the A0 level maps to interventions that are feasible at the local host level. Moreover, to the extent that the application stack includes built-in redundancy to mask system failures from the users - as in a redundant server node running an application in standby mode - the metric can show the same mask of underlying system failures not immediately visible to the user. With respect to user impact and user-experienced availability, the lower ranks are less informative, and higher ranks subsume lower ranks. A corollary of the subsystem/supersystem demarcation is that availability metrics are inherently not comparable across levels. In other words, 98.2 percent availability at A1 cannot be compared with 99.2 percent availability at A2; the A1 level may not cover redundancies that mask failures at the next level up. Moving
up and down the hierarchy
From A1 to A2, a metric adds information demonstrating the existence of a service, rising to meet the definition of a service as an application delivered over a network. Note that because the network is inherent from this level and above, there's no need to measure network availability alone, independent of application traffic making its way back and forth from the application. Performance is measured only with respect to failover time.
From A3 to A4, a metric adds user populations as the necessary complement to its response time requirements. For most systems, there's a material difference between one user driving 1000 transactions each minute, and 100 users driving 10 transactions each minute. By translating system work directly into user impact, the A4 metric provides the most complete indication of the impact of availability on the consumers of a service. Conversely, when a system's utilization drops within a user population, the impact of downtime is adjusted appropriately. In a perfectly instrumented system, the A4 metric would be enabled in closed-loop fashion, so administrators and end users would know exactly what throughput the system was achieving at any given time, in terms of user work. For example, administrators could log browser error messages on user workstations or PCs. For systems that are not perfectly instrumented, A2 and A3 synthetic transactions provide the most representative picture of how much work the system is doing. However, for many applications, particularly those with named users, it's possible to establish exactly how many users are on the system at any given moment and derive an A4 metric by observing key transactions and their response time. At the A0-A2 level, most indicators do not provide positive information about system availability, though they can show when work is being done. The lack of transactions does not mean that the system is down; it may mean that no users are performing transactions, or that the entire population of users is on a lunch break. But it may be possible to correlate a set of passive data measures into a record of how much work is being done at any one time. In this respect, log data can show how many transactions were completed over a particular period, albeit an incomplete positive indicator of application availability (the lack of logged operations does not mean the system was down). Applying
SLIMTAX to an application stack
SLIMTAX can be represented as a simple matrix that represents the service components of the stack along the vertical axis, and the hierarchy of result measures (A0-A4) along the horizontal axis, as illustrated in Figure 4.
Figure 4. Hierarchy of Result Measures What the chart represents is a roadmap from server-centric, infrastructure-focused measures of uptime, on the lower left-hand corner of the matrix, toward user-oriented measures, represented on the upper right. Most measures of application availability can be mapped into this hierarchy based on how they test for service availability (A0 through A4) and where in the stack they test for it. Operational groups just beginning to take on the challenge of user-oriented availability measures are often tempted to focus only on measuring end-to-end, service-level availability because they believe, "That's what users care about." It's important to recognize that it's often difficult to establish these kinds of measures at one stroke. Targeting a metric that represents only part of the stack or measures a certain degree of user activity is a better way to achieve solid incremental progress. It's also useful to move up the stack within a single result-measure class; for example, if you have a test and a metric that show A2-level database session availability over the network, a logical next step would add A2 application server availability measurement over the network. It's not necessary to move directly from A2 to A3 in order to produce a useful, end-to-end measure. Moving a test further up the application stack provides more information to the service consumer. Ironically, it provides less actionable information to the service provider. For example, an end-to-end test might:
Depending on the response time criterion for the test, it measures the service at either the A2 or A3 level. If the test passes, the service is available; if it fails, the service is not available. Designing this test correctly requires some analysis of the application architecture to ensure that data retrieved from the database is not cached on the application server, to avoid masking a database failure. SLIMTAX:
Measurement versus management
Contrast this with the management perspective. A diagnostic view of this test would take into consideration results at each step of the way. For an administrator, having such data available can help deal with an outage, either with some policy-based automated recovery mechanism or as after-the-fact diagnostic. Or, to make certain that the test result can be delivered, the administrator could complement the end-to-end test with a set of A0-A2 tests, either in real time as an automated recovery mechanism, or as diagnostics dealing with of an outage. It's up to the administrator and the application architect make a number of choices:
This service-level test and its complementary diagnostic or management indicators is mapped onto the SLIMTAX matrix shown in Figure 5. Note that the diagnostic examples set also shows the possibilities for applying the principle of service decomposition to create service-level targets within the application stack that capture its constituent parts. The database cluster heartbeat is a good example of management information that can be inferred from the system architecture, as a cluster is a local service level unto itself. While it may be part of a management mechanism with automated policies, it isn't difficult to extract information about whether those policies are working and report this as part of the availability measurement effort. Test Frequency, Timing and Performance in SLIMTAX Sampling
frequency
Failure, interruption, recovery time, outage, timeout - all these terms represent the interval during which the application is not available. Since one goal of AAMe is to represent how long an application is up, a good test identifies when the application stack is not available. For instance, given an architecture designed to recover from an outage in twenty minutes, the implied uptime goal is to keep interruptions under twenty minutes. Consequently, the service level measure should test more than once every ten minutes, to make certain it never misses an outage. This is a good place to apply an A2 metric; so long as the test can establish a remote session into the monitored system every two minutes, the application is up. More than 10 successive failures of the test indicates that the application is missing its recovery-time target.
Figure 5. Applying the SLIMTAX hierarchy to a service-level objective Therefore, outage duration is not the only consideration in deciding how often to test if the system is up. Take a service that is down four times in one hour - once every 15 minutes - three minutes at a time. If a user tries to get onto the system at exactly the same 15-minute intervals as the outage, it may appear that the system has been down for an hour. For users, frequency of outages matter as much as (or more than) duration. Note, again, the difference between management and measurement. From the system management perspective, even one failed A2 test should trigger an intervention, manual or automated. Measurement assumes that those responsible for management will try to do the right thing, and checks whether they have succeeded or not. When
performance is an availability issue
Viewed this way, distinctions between measurement of performance and availability blur somewhat. However, erasing this distinction clashes with some fairly well-established measures of system behavior. The challenge is in identifying given measures of availability and performance that can provide an indicator of overall system work. SLIMTAX accounts for this in the distinction between A2 and A3 metrics. Implicit in this distinction is a time out; if an A2 session test cannot be established within the time that the application needs to recover from a failure, it's a safe bet that there is a failure. For transactional environments, performance targets for transaction completion time are denominated in seconds; recovery from failure is slower, denominated in minutes. For batch jobs and other long running transactions, completion time may be significantly longer. In these cases, other indicators such as work rates, rows processed, and tables backed up may be more appropriate. How does one account for slow performance as an availability problem? Since it's difficult to tell when a single transaction misses its target speed whether it's a general performance problem, the system needs a performance profile that targets a performance threshold. At the A3 level, such a threshold could be characterized as a minimum of 9000 transactions per hour, with an average rate of 15 seconds per transaction. One way to determine whether the system hits its performance target is to sample completed transaction logs retroactively, and inspect to see whether they were completed on time. Given the sensitivity of most administrators to performance issues, such metrics are monitored much more closely. Moreover, they provide excellent evidence of success or failure. An important aspect of availability performance is measurement of degraded operations. This can be applied in several ways. Administrators report that users know when the system is experiencing a traffic jam, even if they have yet to determine the cause. It may be possible to specify a target limitation for degraded operations; as in "98 percent of all synthetic transactions must complete within 15 seconds; no more than five percent must complete in more than 15 seconds, but less than 20 seconds." However, inspecting transactions locally doesn't show whether end users were able to complete transactions successfully. A more useful measure of application availability would drive a synthetic transaction - a set of user-level operations, scripted and driven by an automated tool that captures the result of the transaction, including elapsed time. Designed correctly, such a synthetic transaction shows when a system is performing at the required level without actually creating a great deal of overhead on the system. A few synthetic transaction users placed around the network will reveal a great deal about the end-to-end performance of the system. Using synthetic transactions, outages can be declared when the system is performing at less than its target performance rate for a specified period of time. This service-level objective would state that 98 percent of all synthetic transactions must complete within 15 seconds. Most Web site and Internet monitoring takes this form, using synthetic transactions to drive traffic at a Web site to emulate the end-user experience. The most complete measure of performance as availability accounts for users on the system as well as the rate of their work proceeding through the system, designated by SLIMTAX as an A4 metric. Some infrastructures lend themselves to complete instrumentation, so administrators know in real time exactly how many concurrent users are doing how much work on the system. It's possible to infer this by, either live or retroactively, looking at active, concurrent users (those who submit input at least once every 15 minutes, for example), and compare them to the total number of transactions, or correlate that number with the rate of synthetic transactions. Distinguishing
architecture and operations from AAMe and service
levels Because service providers and IT operational staff are accustomed to viewing their infrastructure through the lens of management, such management tools have emerge as the preferred technique for measuring system and application availability. This tendency is more the product of tactical operational considerations - i.e., given a problem, find an action to be taken - than tracking whether a system meets its desired availability goals. Knowing the behavior of one element of a system - hardware MTBF, disk utilization, network traffic - won't represent the behavior of an application and operating system software or expose dependencies between those layers. The flaw at the heart of this idea is not that management tools are useful. Rather, it is that they do not adequately represent whether user work is being undertaken and completed - in other words, does the end-to end-system provide continuous application access with predictable performance? For most networked application environments, the most useful technique to apply is measurement. While it may seem obvious, service providers must realize that exposure of management and monitoring information internal to a service is unnecessary, so long as service consumers are not directly involved at the operational level. Any consumer of a service is more concerned with the attainment of service-level targets than the underlying implementation. The challenge is to understand which metrics provide the necessary information to the service consumer, and to select measures that can provide information that keeps both sides focused on service-level attainment. Of course, the administrator is not alone in deciding how to manage the components of the service; it's up to the systems architect to account for the service level and manageability architecture. In an ideal world, the two communicate; in reality, it's often up to the administrator to make inferences about the architecture in choosing the most useful management information. An implicit theme in the discussion of measurement is the virtue of simplicity. The primary question is not how but what to measure. Let's explore different approaches one might take in taking specific measurements of a service. Defining
a service-level indicator
The practice of testing for service levels has many elements in common with the broader discipline of software and system testing. In fact, several test tool vendors are entering the market for service-level monitoring, since the technology for service-level measurement and monitoring is very similar to that used for test automation. This is particularly true of test design, since it is important to apply many of the same architectural analysis skills. For example, when a test transaction selects a record from a database, which tables does it select from? Are these tables the ones most likely to show that the database is having a problem? However, testing for service levels need not be as complicated as regression, stress, or system envelope testing. For simple applications with Web interfaces, an automated test can use either a perl script or a servlet that logs its results to a flat file, spreadsheet, or database. Creating an A2 metric for database availability, for example, could be implemented by embedding JDBCa calls into a perl script, and setting a flag based on the correctness of returned output. Of course, programming, maintenance, and attention to results are still required to ensure that the test itself isn't broken. It's also possible to create synthetic transactions with formal automated testing tools. Such tools typically have more than capture/replay capabilities, adding language facilities to create logic in automated test scripts that can handle exception conditions, deal with conditional inputs and outputs, and log data in a repository. To facilitate data collection and analysis, it's also useful to store outputs in a data store (or spreadsheet, at a minimum). This data store should enable you to easily retrieve and present trend information. A practice that many operations personnel find useful is to post a dashboard that shows the state of key services on the organization's intranet. This allows any user who needs a service to check the traffic when using an application. Several of the automated service-level monitoring tools in the market have this facility built in. A good complementary practice for a Web-based dashboard is to provide links to service-level definitions, as well as prose descriptions of service-level targets, so a user who sees a service-level trouble indicator can also check what the service level covers. Defining
a synthetic transaction
Requirements for a synthetic transaction should address: 1. Service Scope. What are the boundaries of the service level tested by the synthetic transaction? For example, does it include or exclude the local area network? Does it mask redundancies? 2. Geographic Scope. On the network, where should the synthetic transaction execute? Executing a single synthetic transaction within the four walls of a data center at the same time the transaction runs across the corporate network (or the Internet, for that matter) can help establish the relative availability of the network compared to the service. Running the same transaction independently from multiple locations in the corporate network provides significant diagnostic information through triangulation and comparing results. This also represents a significant opportunity to add operational diagnostic capability; or, alternatively, to define a service level objective that excludes the outer levels of the network. 3. Functional coverage. This is the core of any test: what subsystems and functionality are exercised by the operations performed during the test? What subsidiary components of the service stack are exercised by the synthetic transaction? For example, if a sample transaction includes data input and retrieval, does that data input cause the application server to do a "select" from the database to demonstrate its availability, or does it just retrieve cached information from a file server? Functional coverage analysis requires close collaboration between administrators and system architects, but with a strong bias to end orientation. One good test of the functional coverage is to ask average users how they know the system is down, and see if they can readily perform the same transaction manually. It's also important to resist the temptation to create synthetic transactions that provide a wealth of diagnostic information; complexities introduced in the pursuit of optimization and troubleshooting data can make tests less robust and generate false negative results. The focus should be on a test that closely reflects what users do, in spite of the possibilities for optimization 4. Randomized think time. Users don't fire off inputs to an application as fast as outputs emerge; they pause between entries, to think or for a cup of coffee. Within reason, the synthetic transaction should be able to vary the time that passes between key inputs to better simulate how users interact with the system. 5. User operation demarcation (start/end). Like any good test, a synthetic transaction must begin and end at a known system state. Again, the focus of the test is on completion, not diagnosis. 6. Response timeout. When there is an outage, what is the service's recovery target? This applies at two levels. First, for each step of the test, how long should it wait for the to system respond to a single input? Second, for the entire synthetic transaction, allotted completion time should also be a function of its SLIMTAX classification. A2 metrics can sustain longer response times, since it is necessary to establish only whether a service is there, not how quickly it responds. By contrast, A3 and A4 metrics should time out quickly, subject to the performance targets of the system. Again, brevity in a synthetic transaction is a virtue; many short transactions serve the purpose of measurement better than a few long-running ones. The higher the metric rises in the SLIMTAX hierarchy, the more frequently tests should execute. 7. Operational windows. When does the service being tested need to be available? A simplified way to account for operational windows is to manually review availability data accumulated by the synthetic transaction, and consider only those outages that took place within operational constraints. Alternatively, the synthetic transaction can be programmed to consult a table of service parameters and ignore outages at certain times. 8. Service level goal: percent success. Appealing though it may be to set goals in terms of 100 percent, 24x7 uptime, both users and administrators will find it more constructive to work against outage budgets, denominated in minutes. Outage budgets can also be allocated to root causes following analysis of service-level attainment; network, application, and operating platforms can be allocated a certain fraction of the outage budget and independently managed to meet those targets. As Forrester noted, additional "nines" will cost millions of dollars: a 99.999 percent available site can easily run up more than $6.9 million in capital costs and $2.9 million in operations. (January, 1999, Forrester Report "Nonstop eCommerce"). Many companies are playing catch up just to reach an acceptable level of availability. Others are adding improvements to enhance site performance and increase functionality. Yet infrastructure costs just scratch the surface. Highly structured production environments with multiple tiers, several geographic locations, and dozens of servers can stifle innovation. This type of environment will also uncover staffing issues as firms struggle to find - and pay for - the highly skilled and expensive people who can build and maintain those sites. The search for ever-increasing availability can easily reach a point of diminishing returns. Instead of spending $3.6 million to achieve 45 minutes less downtime in a year, those funds might do a better job of attracting and retaining new customers if they were invested in a targeted advertising campaign. Acknowledgements:
About
the author:
|
include("reg.html"); ?>
sponsored by
|
|||||||||||||||||||||||||||||
| Copyright (c) 2000-2003, nextslm.org. All Rights Reserved. Legal Statement. | ||||||||||||||||||||||||||||||