3 Results Information

Marc Robinson
Published Date:
October 2007
  • ShareShare
Show Summary Details
Marc Robinson1

This chapter focuses on performance information on the results delivered by public expenditure. It commences with a primer on performance concepts. The chapter then examines certain aspects of the public sector production chain which can create uncertainty in the results which public expenditures might be expected to deliver. Following this, the limitations of performance measurement are discussed, and this leads into the role of program evaluation in adding important performance information which cannot be provided by performance measures alone. Finally, a number of key principles of implementation strategy are identified.

This chapter does not aim to provide detailed hands-on guidance on the design and implementation of performance information systems. This is because, as mentioned in Chapter 2, there is an abundance of literature on this topic, and also because performance information systems should be designed with multiple uses, and not only budgeting applications, in mind.

Core performance concepts

A review of core performance concepts is important because an analytic discussion of performance budgeting requires a clear conceptual framework. All too often, discussions of performance budgeting and management have been conducted at cross-purposes because of terminological misunderstandings and confusion.

The most fundamental concepts in the conceptual framework of performance management and budgeting are outcomes, outputs, activities, and inputs. An output is a good or service provided by an agency to or for an external party. Outcomes are the intended impacts of those outputs. Thus, for example, the medical treatment received by a road accident victim is an output, the intended outcomes of which are the preservation of the patient’s life and the minimization of any disability resulting from the accident. Similarly, investigations of crimes are a police output, and reduced crime the intended outcome of that output. Activities are types of work task undertaken in the production of outputs. The delivery of an output generally requires a set of coordinated activities of different types and in different quantities. (The term “process” is often used more or less synonymously with “activity,” and differences between the two terms are too subtle to be worth dissecting here.) Thus the treatment of the road accident victim may involve a combination of direct treatment activities such as surgery, nursing, and anesthesia, as well as supporting activities such as supplies and facility management.2Inputs are resources used in the carrying out of activities to produce outputs (for example, labor, equipment, buildings).

Outcomes—to define the concept more precisely—are changes brought about by public interventions upon individuals, social structures, or the physical environment. It is common for an output to have more than one intended outcome. For example, the intended outcomes of school education include increased knowledge (the “student learning outcome”), a stronger economy, and a more harmonious society (through, for example, greater respect for the rights of others and for the law). This example also highlights the important distinction between high-level and proximate outcomes. Proximate outcomes are the more direct or immediate impacts of the output, whereas high-level outcomes refer to the ultimate objective or purpose of providing the output. The former are means of achieving the latter. In the case of school education, knowledge is a proximate outcome, whereas a stronger economy is a high-level outcome.3 Utility—the concept used in microeconomics to refer to the satisfaction or well-being derived from outputs—can be regarded as the highest-level outcome of all.

Although the use of the term “outcomes” to refer to all impacts of public interventions is predominant practice, there are some who seek to distinguish between “outcomes” and “impacts”—usually with something like the distinction between proximate and high-level outcomes in mind.4 This usage is not employed here.

This set of widely-understood and largely standardized concepts is referred to as the “logical (results) framework,” the interrelationships of which may be thought of as representing a public sector production chain (Figure 3.1).

Figure 3.1.Public sector production chain

Having defined the key concepts in broad terms, it is useful to pin them down more precisely. Outputs are the “products” supplied by government, although in using this term it must always be borne in mind that most government outputs are services rather than physical goods. Products—outputs—can only be said to have been produced when the production process is complete. One would not consider a car which was pulled off the production line before the engine was installed and wheels attached to be a “product.” The same applies to service outputs—if the medical team walks out in the middle of an emergency operation on the road accident victim, making it impossible to achieve the intended outcome of saving the patient’s life, no output has been delivered. Expressed differently, if it is to be considered an output, a service (or good) must be potentially capable of producing its intended outcome.

This sheds important light on the relationship between outputs and outcomes. In stating that a completed unit of output must be potentially capable of producing its intended outcome, the word “potentially” is important because there are many services where there is uncertainty about whether the intended outcomes will be achieved in each individual case. There may, for example, be considerable uncertainty about whether even the best emergency treatment of a severely injured road accident victim will succeed in saving the patient’s life. For such outputs, all that can be said is that the expected outcome is positive. It follows from this that it is inappropriate, for some types of services, to define the completion of the production of the output in terms of the achievement of the desired proximate outcome. In other words, outputs do not necessarily produce their desired outcomes.

This also helps clarify the distinction between activities and outputs, which is a matter in respect to which confusion is common. As noted above, service outputs generally involve bundles of activities. These constituent activities cannot normally be considered to be outputs in their own right because it is only when they are combined with the other activities that an intended outcome can potentially be achieved. Anesthesia, for example, may be of no potential benefit to a patient unless combined with other treatment activities such as surgery (likewise surgery cannot be delivered without appropriate anesthesia). Another reason why the distinction between activities and outputs is important is that in the case of collective services,5 a single activity or set of activities may deliver multiple outputs. For example, teaching is an activity rather than an output because the teaching activity of one teacher delivers a service to multiple students in that teacher’s class.6 The educational output is therefore students taught, rather than the teaching itself.

As noted above, the usual definition of an output refers to products delivered by agencies to or for external parties. The most important implication of this is that services provided within an agency to internal clients—such as the services provided by an agency’s human resources unit to its line divisions—are not in the performance budgeting literature generally considered to be outputs. The logic of this is that to, say, a car manufacturer, the services provided by the internal marketing department, or the finance department, are not products—cars are the product.7 Confusion may best be avoided by referring to the services (or goods) produced for internal clients as “intermediate” outputs—with the understanding that when the term “outputs” is used alone, it refers to services provided to external, and not internal, parties.

The concept of output quality is a particularly slippery one, about which there is no general consensus. It is, unfortunately, not possible to review the issues involved here. Output quality may, however, be defined as the extent to which the characteristics of the product are such as to increase its potential capacity to generate its intended outcomes. For goods, the “characteristics” of the product refer principally to physical characteristics. For services, product characteristics refer to the nature, quantity, and mix of the activities which comprise the service output, as well as to the timeliness of the delivery of the overall service output. For example, high-quality treatment of a road accident victim means that the victim expeditiously receives the appropriate medical interventions. A crucial point about this definition is that it distinguishes between the quality of an output and the actual outcome which it achieves: in this usage, output quality increases the probability that the outcome will be achieved, but does not guarantee it.8 It is regrettable that the term “quality” is all too often used in a manner which equates it with outcomes.

Effectiveness, efficiency, economy, and equity

Performance management and budgeting distinguish between efficiency and effectiveness—concepts which are drawn from the public administration literature, and which are different in meaning from the terminology used by economists. Effectiveness pertains to outcomes, whereas efficiency relates to outputs. More specifically, efficiency refers to the degree of success in delivering a defined output at minimum cost while holding quality constant, given prevailing input prices.9 To cut the costs of issuing a certain regulatory approval by redesigning the relevant work process would, for example, improve the efficiency of that service (so long as quality is maintained).

Effectiveness, on the other hand, refers to the degree of success of a particular output or program in achieving its intended outcomes (thus a road safety awareness program is effective if it reduces road deaths). Cost-effectiveness increases if the same outcomes are produced at lower cost, given prevailing input prices. Cost-effectiveness may be improved either through more efficient delivery of existing services, or by changing the nature of the service. It is possible for an output to be efficiently produced but ineffective.

How do these terms relate to concepts used by economists? The term “effectiveness” is not used in the professional terminology of economics. Economists instead use the term “efficiency,” although this is a term which they use in two distinct senses.

The first of these is technical efficiency, which corresponds to the public administration concept of efficiency. Technical efficiency refers to the choice of the least-cost combination of inputs for the production of the good or service concerned, for given input prices. There are two prerequisites for technical efficiency. One is the absence of waste (which economists sometimes refer to, following Leibenstein (1966) as X-inefficiency10). The other is that the right mix of inputs is used.11 Technical efficiency is sometimes also referred to as productive efficiency or operational efficiency, and is also what the less formal term productivity usually connotes.

The other relevant economic concept is allocative efficiency. This subsumes, but is also broader than, the public adminstration concept of effectiveness. A standard definition of allocative efficiency is the production of the best or optimal combination of outputs given society’s valuation of those outputs. Allocative efficiency thus defined requires and subsumes technical efficiency. However, at the heart of the concept of allocative efficiency is the requirement that the overall mix of goods and services is such as to maximize social well-being (utility). The degree of allocative efficiency of government expenditure therefore depends not only on the effectiveness of government expenditures in achieving their intended outcomes, but also upon the extent to which (a) those outcomes are beneficial to society and (b) it might be possible to deliver even greater benefit to society by changing the output mix.12

Equity is not an outcome, at least in the sense defined in this chapter. It is, rather, a property of the distribution of outcomes, outputs, or expenditures. Society may have different notions of what constitutes equity in relation to different government services, depending on the relevant concept of what Sen (1992) calls “basal equality.” For some services, it might be thought that equity requires that each client receives the same level of service—that is, that there be an equal distribution of outputs. For other services, however, it may be considered that equity demands either equal outcomes, or at least that the dispersion of outcomes be contained within some acceptable range. For yet other services, an equal distribution of expenditure may be desired. These concepts of equity are often mutually exclusive. In school education, for example, the delivery of the same education to all students of a particular cohort (equality of outputs) will yield widely different student learning outcomes. If one wishes to reduce the degree of divergence of these outcomes, it is necessary to accept greater inequality of outputs (through, for example, additional services for disadvantaged students). Another example of the conflict between different concepts of equity is that equality of expenditure per student will not deliver equality of outputs (let alone of outcomes) for a range of reasons, including the fact that some students live in more remote areas where it costs more to deliver education services. A choice therefore needs to be made as to which concept of equity—or rather, what compromise between these competing concepts of equity—will guide the provision of education.

Controllability and predictability in producing results

In the public sector production chain, money is spent on acquiring inputs which are used to carry out activities, in order to produce outputs, and these outputs are in turn intended to generate outcomes. Performance budgeting seeks to strengthen the linkage between the starting point of this production chain (the money) and the end points (outputs and/or expected outcomes). In considering how far, and in what way, it is possible to link funding and results, an important consideration is the extent to which the results delivered via the production chain are controllable and predictable. There are, in this context, two important sources of uncertainty which need to be taken into account. One concerns the link between outputs and outcomes and the other the link between activities and outputs.

The link between outputs and outcomes

The characteristics of individuals, social structures, or the physical environment which public interventions seek to change are almost always subject to multiple influences outside the control of the unit of public administration, the performance of which is being examined. These influences are often referred to as external factors.13 External factors may be characteristics of the client—a familiar example of which is the impact of student characteristics (intelligence, motivation, parental and social support, and encouragement) on knowledge outcomes in education. Or they can be aspects of the context in which the public program is being delivered. Thus, for example, a labor market program, the intended outcome of which is to increase the rate of reintegration into the workforce of workers who have suffered an industrial injury, will need to recognize that hiring rates for the program’s target group are subject to the overarching influence of the state of the macroeconomy.

What constitutes an external factor varies depending upon the institutional focus. From the perspective of individual government agencies, external factors commonly include the actions and performance of other government agencies. For example, the educational effectiveness of individual schools will be influenced by the regulatory regime imposed upon them by the national or regional education ministry. Similarly, the effectiveness of the educational system as a whole will tend to be affected by the quality of anti-poverty policies (for example, are steps being taken to ensure that all children arrive at school with a full belly?). But from the perspective of the government as a whole, such influences are controllable and are therefore not external factors at all.

The consequence of external factors is that the “same” output may deliver different outcomes at different times, in different contexts, or when delivered to different clients. Moreover, to the extent that the impact of external factors is unpredictable, it may be difficult to specify ex ante what outcomes a service should be capable of delivering. External factors may also make the timeframe in which outcomes may be expected to be achieved uncertain. In some cases, the degree of such unpredictability in the relationship between outputs and outcomes may be high. To take an extreme example, there tends to be enormous ex-ante uncertainty about the outcomes (whether the perpetrator will be caught and punished) which can be expected from murder investigations. Similarly, it is very hard to predict in advance how much impact, and over what time horizons, to expect of public health programs which aim to reduce rates of cardiac and other disease arising from poor lifestyle habits. In general, high-level outcomes are more subject to the impact of external factors than proximate outcomes.

The link between activities and outputs

Most textbook microeconomic analysis focuses on homogeneous (standardized) products. For physical goods, homogeneity means that the physical characteristics of each unit of the product are essentially identical, as is the case for mass-manufactured commodities. A homogeneous (standardized) service may, equivalently, be defined as one where the set of activities delivered to each client are the same.14 In the public sector, motor license testing is an example of a highly standardized service—the activities required to administer a motor license test (administration of a written test, followed by observation of the candidate carrying out a pre-specified set of driving activities) are usually both precisely prescribed in regulation and essentially the same for all would-be drivers.

Very few public sector services are as standardized as mass-produced goods, but a significant number are (or are intended to be) quite standardized. Thus in many countries the aim is that government-supplied school education should be essentially the same for the bulk of students of each cohort—with, say, nearly all third-graders in public schools receiving pretty much the same education. By contrast, however, many public services are intended to be heterogeneous. Outputs are heterogeneous when there is a deliberate variation in the amount and/or types of activities received by each client/case of the “same” service, particularly in response to differences in client or case characteristics. An extreme example is police murder investigations. Differing circumstances make some murders easy to solve, while others are difficult or even impossible, and this leads to great variations in the activity devoted to investigating them. In acute medical care the treatment accorded to two different patients receiving treatment for the same condition—whether it be a heart attack or a broken hip—will often vary because of other factors such as the age and general condition of health of the patient. Some degree of heterogeneity often affects even relatively standardized services. Thus even if a national education system aims to deliver essentially the same education to the bulk of students, it is commonly also the intention that certain categories of the student population—such as children with disabilities—should be the beneficiaries of more intensive teaching and other services which are tailored to their specific needs. As this example suggests, the degree of heterogeneity in public service delivery depends not only on the degree to which client or case characteristics vary, but also on the extent to which policymakers and the community decide that they wish to tailor services according to those characteristics.

Heterogeneity is the deliberate variation in the level of service activity delivered to different clients/cases because there is a conscious objective behind the targeting of some clients/cases for greater activity than others. Very often—as in the provision of more intensive education for children with disabilities—the objective is to reduce the disparity of outcomes relative to that which would result if all clients received the same service. Reduced inequality of outcomes is, however, not the only motive for heterogeneity.15

Heterogeneity is relevant to performance budgeting because the greater the degree of heterogeneity affecting a particular service, the more uncertainty may arise in the relationship between activities and outputs (as well as between inputs and outputs). Consider, for example, a program aimed at providing emergency accommodation (for example, in foster care) for children who cannot, for some reason, be cared for by their own family. The duration (and type) of placements for such children can be expected to vary enormously—and in a way which is not usually unpredictable in advance—depending on the reason why their parents are unable to take care of them (for example, temporary illness, death, domestic abuse), and upon a range of other factors such as whether there are other family members who can be located and who can care for the child. (Moreover, an integrated approach to service delivery would treat as part of the same service the delivery of other support activities such as psychological support services, the need for which would also vary greatly been clients.)

Heterogeneity has another important implication—that for heterogeneous services, the concept of an output “type” (that is, the type of product) may be inherently ambiguous. For mass-manufactured products, there is no ambiguity, because a product type is defined as a class of physically identical products. The same is true for a standardized service such as motor license testing, because the prescribed activities are the same for each client. In the case of a highly heterogeneous service, however, the activity content of the outputs delivered may vary so much between clients/cases that it may be impossible or impracticable to define output groups for which the activity content is identical. In the DRG system, for example, even when output types (DRGs) are defined more narrowly—as in the hip fracture example—a degree of heterogeneity remains. For such types of services, whatever definition of output types is adopted, output quantity measures will inevitably be affected to some extent by an additivity problem—a problem, in other words, of adding apples and pears.

Box 3.1.Heterogeneity and the definition of outputs

The heterogeneity problem may be reduced if, and to the extent that, it is possible to pre-classify clients or cases with greater precision in such a manner as to provide greater predictability about the type and extent of activity which each will require. What this involves, in essence, is the development of a more “fine-grained” classification of product types (that is, of outputs). A good example of this can be found in the so-called “diagnosis related group” (DRG) health output classification system. As outlined in Chapter 1, DRG output categories—originally designed as a performance measurement tool—are now used in many countries as the basis for a “purchaser-provider” form of performance budgeting. Patients are classified into groups (DRGs) depending primarily upon the type of medical problem for which they are being treated, and a price is determined for each DRG category. Thus, for example, hospitals might receive $x for every hip fracture patient they treat, and $y for every hernia patient. The DRG classification system became progressively more detailed over the years following its inception precisely in order to reduce the degree of heterogeneity within each DRG (output) category. Thus, for example, in the earliest versions of DRG classifications, all hip fracture patients were categorized under the same DRG. However, because there was great variability in the extent and nature of the treatment required by different hip fracture patients, research was undertaken to identify patient characteristics which were good predictors of a more intensive treatment requirement. It was established that the two most important predictors were (a) whether the patient was an aged person or not and (b) whether the patient had other significant acute health problems. As a result of this, the single hip fracture DRG was, in later versions of the DRG output classification systems, broken into three different DRGs—one for aged hip fracture patients, one for hip fracture patients with secondary complications, and a third group for all remaining hip fracture patients. This reduced the heterogeneity problem greatly, although it did not eliminate it entirely. Unfortunately, however, there are limits to how far this approach can be applied across the public sector. One limitation is information costs—the more fine-grained the classification of outputs, the more costly it is to obtain reasonably accurate output costing (see Chapter 4). The other limitation arises from the fact that it is sometimes very difficult to distinguish in advance case/client characteristics which will lead to differing required levels of activity.

There is, in addition to heterogeneity, another source of uncertainty in the relationship between activities and the outputs produced by that activity which is worth specific mention in a government context. This concerns collective services where, as noted above, a single set of activities delivers outputs to multiple recipients (recall that, as noted above, it is the receipt of the service by the client, and not the delivery of the activities by the producers, which constitutes an output). The extreme case of this is services which possess the pure “public good” characteristic that the marginal cost of supplying the service to an additional client is zero. A classic example of this is the service provided by a free-to-air public (or private, for that matter) radio or television channel. Clearly if the marginal cost of additional output is zero, the relationship between the level of funding provided and the quantity of output delivered is completely indeterminate.

Measuring results

Performance measures may be defined as ratings or quantitative measures which provide information on the effectiveness and efficiency of public programs. It should be emphasized that, for reasons explained below, the terms “performance measures” and “performance indicators” are used interchangeably in this volume. Like performance information in general, performance measures fall into two categories—those which provide information about the results achieved, and those which provide information on the costs of achieving those results. Here the focus is principally upon the former (costing issues are considered in Chapter 4).

Great progress has been made in output and outcome measurement in many countries over recent decades. A recent survey of OECD countries16 (OECD, 2005) reported that 77 percent of (27) respondent nations had taken their first government-wide output measurement initiative less than five years ago, and that half of respondents claimed to have developed a combination of output and outcome measures. There exist a number of impressive examples of best practice in performance measurement. One is the system of local government performance measurement and reporting which has been developed in Britain since the early 1980s. Under this system—managed primarily by the Audit Commission—a standardized set of performance measures for the full range of services delivered by local government (education, police, housing, waste management, social services, and so on) has been developed and is publicly reported.17 More recently, under the “Comprehensive Performance Assessment” initiative, these performance measures and other information have been used to develop a composite performance ranking system under which each local government is rated on a five-point scale from “poor” through to “excellent,”18 with this rating being made public. (This illustrates, incidentally, what is meant by “ratings”—as opposed to quantitative measures—in the definition of performance measures above.) Another outstanding example of how far performance measurement can be taken can be found in the annual New York City Mayor’s Management Report.19 In light of this type of success in developing quality performance measures, there is today little scope for arguing—as some have done in the past—that the measurement of results is too difficult and imperfect to be of significant managerial value.

Performance measurement is, nevertheless, a very imperfect science, and it is important to any systematic study of performance budgeting to appreciate its limits as well its possibilities. The following paragraphs therefore briefly survey some of the most important constraints which limit our capacity to measure outcomes and outputs.

Outcome measurement

Three key difficulties arise in the measurement of outcomes. The first is that it is often difficult to distinguish the impact of a public program from that of external factors. Thus, for example, if the crime rate falls from one year to the next, it may be very hard to tell how far (if at all) this is due to better policing and how far to, say, broader social or economic factors outside the control of the police. For this reason, many outcome measures—such as crime rate statistics—do not even attempt to distinguish program outcomes from the impact of external factors. Another way of viewing this problem is that measuring outcomes requires knowledge of a “counterfactual”—namely, of what would have happened had the government not intervened. In some cases, one can only speculate about such counterfactuals. For example, who could say whether, in the absence of the presence of a strong army, a nation might have been subject to military aggression?

There are, however, measurement techniques which enable us in some cases to adjust to at least some degree for the impact of external factors. In the educational sphere, for example, the importance of student characteristics in determining knowledge outcomes has led to the development of so-called “value-added” measures which disaggregate student knowledge outcomes by relevant student type, on the basis of research which identifies key socioeconomic and other determinants of educational success. The idea is that by permitting, for example, comparisons of the performance of similar students in different schools, one will obtain a much better idea of the relative performance of those schools than if one compared unadjusted results.20

The second, closely-related, difficulty is that it can be difficult to distinguish the impacts of multiple programs which have common intended outcomes. Consider, as an extreme example, the difficulties which would arise in seeking to determine the respective impact on the promotion of economic growth of the multiplicity of government services with that objective (for example, education and training, scientific research, macroeconomic policy). As this example illustrates, this problem tends to be particularly severe for high-level outcomes.

Sometimes, it is possible to use regression analysis to get a better sense of the outcomes achieved by public expenditures. The minimum prerequisite for this is a database of considerable size and (for time series analysis purposes) time duration. However, as Pradhan (1996, p. 12) notes, there is often “a lack of consensus among these studies even about the direction of impact of key expenditure categories (for example, positive or negative impact of the share of health, education and transport spending)…[as well as] uncertainty about its magnitude.” Rarely, moreover, is it possible to use such statistical analysis to disentangle the separate contributions of distinct public programs with a broad category of expenditure.

The third key outcome measurement challenge is that some outcomes—particularly high-level outcomes—relate to social variables for which there does not exist, even in principle, a good measure. Consider, for example, the contrast between the outcomes sought by anti-pollution programs—where pollution levels are in principle perfectly measurable—with those of public programs designed to build greater interracial tolerance. There is no good metric for the degree of interracial tolerance in a society. For variables of this type, one is required to use inherently imperfect (and sometimes highly imperfect) proxy measures—such as, in this case, measures of the incidence of race-related crime.

As noted above, utility—well-being or satisfaction—is the highest-level outcome. If we were able to measure the utility which is, or would be, generated by alternative mixes of public outputs, the problem of allocative optimization would be solved. The aim of cost-benefit analysis is, in principle, to develop benefit measures which are proxies for utility, with benefit being measured in terms of willingness to pay. Cost-benefit analysis can, at its best, yield valuable performance measures which can be a useful part of the overall performance information base of a good performance budgeting system. Unfortunately, however, benefit measures often range considerably and are for this and other reasons of questionable reliability. Moreover, cost-benefit analysis is an expensive process which cannot be cost-effectively applied on other than a highly selective basis.

The inability to measure, or obtain good proxies for, the utility of public expenditure in a single metric has far-reaching consequences. It means that there is in general no common yardstick for comparing the allocative merits of alternative expenditure options with significantly different intended outcomes. Thus the outcome measures one might be able to derive for, say, police services (based on crime statistics) and education services (student knowledge outcomes) may enable us to assess the effectiveness of those services, but they do not provide us with a means of determining their relative value to society. In general, in making allocative decisions, it is necessary to make highly subjective and informal judgments. This remains as true today as it was 40 years ago when one of the originators of program budgeting observed that “at the current state of analytic art, no one really knows with any precision how the ‘grand optimum’ [of public expenditure] might be attained…Intuition and judgment are paramount” (Fisher, 1967, p. 63). This informational problem is in fact one of the key reasons why market economies are superior to centrally planned economies21 (which is, of course, not to say that there are not, within the framework of a market economy, major areas where the market fails to various degrees, requiring government intervention in various forms).

In addition to the inherent problems of outcome measurement, measuring outcomes is often made more difficult by the ambiguous manner in which some public sector objectives are specified. It is, as noted previously, a key managing-for-results theme that intended outcomes should be made explicit. Sometimes, however, objectives are kept deliberately fuzzy because there is political and/or ideological conflict about what the objectives of the organization should be (Tankersley and Grizzle, 1994), and it is not desired to aggravate the conflict by going through a process of goal clarification.22

Output measurement

Outputs are in general less difficult to measure than outcomes. It is, in particular, usually possible to measure output quantity (although the greater the degree of heterogeneity, the less meaningful quantity measures will be). The greatest difficulties in output measurement tend to arise in respect to output quality. One reason for this is that so many public sector outputs are services rather than goods. The quality of services is a property of their constituent activities, and activities can only be observed at the time they are carried out—not, like the physical properties of goods, after the event. Heterogeneity also creates further measurement difficulties. For standardized outputs, it is possible to apply what the quality literature refers to as the “conformance to specifications” concept of quality. For a mass-manufactured homogeneous physical good, the application of this concept of quality means testing whether the physical characteristics of (a sample of) the product match the product’s intended specifications. For a standardized service, it means checking whether the activities carried out in dealing with each case/client were the prescribed ones. (Thus, if there is a “treatment protocol” which defines the best form of treatment for patients with a certain medical condition, the quality of treatment can be—at least in part—assessed by determining how far the treatment provided in any particular case accorded with what is laid down in the protocol.) However, the more heterogeneous a service—that is, the more the activities delivered are tailored to the specific needs of the client/case—the less relevant such an approach becomes, and the more difficult it is to measure quality. For many heterogeneous public services it is, indeed, less difficult to develop outcome measures than output quality measures.

The difficulty of measuring quality presents a particular problem for output quantity measurement in respect to the sub-set of public services for which (a) demand is unpredictably variable and (b) the response to demand fluctuations is, at least in part, to vary the amount of activity devoted to each client or case. Consider the example of police investigations of burglaries. The incidence of burglaries in a big city may be expected to vary unpredictably from month to month, and in a month with a particularly high burglary rate, the police might find their investigative resources very stretched. They may respond by working more overtime, but part of the response would almost inevitably be to spend less investigative activity on each burglary than they would in quieter times. What this means is that the quality of each investigation is being varied as a rationing technique. This type of variation of the level of service as a rationing technique—what might be called quality rationing—makes output quantity measures a less meaningful indicator of performance than they might otherwise be.

Quality measurement problems are one reason why activity measures can play an important role as proxies for, or supplements to, output measures for many public services. In the context of police criminal investigations work, for example, this would mean accompanying measures of the numbers of each type of crime which were investigated (an output measure) with measures of the amount of investigative activity per investigation (an activity measure).

The above summarizes only the most important overarching limits on performance measurement. The key point is that performance measures are by their nature imperfect. It is precisely for this reason that it was suggested earlier that the distinction which some draw between performance measures and performance indicators—a distinction in which the latter term refers to imperfect proxies for the former—is not very useful (HM Treasury, 2001, pp. 29-30). For this reason, the terms are used synonymously in this volume. The imperfection of measurement means that performance measures often do not provide conclusive evidence of whether an organization is performing well or badly. Suppose, for example, that an agency’s outputs increase this year even though its budget has not increased. This might be because it has become more efficient. But it might be because it has this year had cases/clients which are on average less complex than last year, or because it has reduced the level of service being provided to clients. Further investigation—desirably involving supplementary performance measures—will be necessary to determine the explanation.

It is often said, correctly, that performance measurement tends to be more difficult in the public sector than in the private sector. The above analysis indicates many of the reasons why this is the case: the general government sector produces mainly services rather than physical goods. Many of the services it provides are heterogeneous, or are contingent capacity in nature. Indeed, it can be argued that these characteristics are fundamental reasons—not fully captured in the traditional textbook taxonomy of market failures—why many public services are provided by the government and not by the market. And of course non-market provision of services means that there are no financial performance measures such as profit/loss and share price.

The degree of performance measurement difficulty does, however, vary enormously between different types of government agency depending on the types of outputs being produced and the types of outcomes being pursued. Similar variations are also to be found in the private sector, from producers of standardized mass-manufactured products at one end of the spectrum to producers of complex and heterogeneous services (for example, consulting, legal services) at the other end of the spectrum.


The limits of performance measurement, and of what performance measures can tell us in isolation, means that assessing the effectiveness and efficiency of programs requires that performance measures be combined with analysis. Analysis is required both in the interpretation of performance measures, and in extending the assessment of program performance beyond the limits of the information which measures alone can provide. Systematic, formal analysis of this type is commonly referred to as evaluation. A broad definition of evaluation is that offered by the OECD: “evaluations are analytical assessments addressing results of public policies, organizations or programmes, that emphasize reliability and usefulness of findings.” This definition covers both ex-post evaluations of the performance of actual programs, and analysis to develop new programs (“ex-ante evaluations”) (OECD, 1998, p. 5). Others, however, reserve the term “evaluation” exclusively for ex-post assessments.23

Evaluation necessarily makes use of performance measures, and good evaluation requires good performance measures. The role of analysis is, however, fundamental. Thus a key part of evaluation is consideration of the policy logic of the program in the light of relevant theory—in other words, an assessment of whether theory suggests that one should expect the program to achieve its intended outcomes. For example, the evaluation of an anti-smoking health promotion program should make use of the insights of psychology and marketing theory about what works (and what does not) in changing human behavior. The greater the degree to which outcome measures are ambiguous or lacking, the greater the importance of the analysis of policy logic. Part of such analysis should be consideration of whether there is a compelling theoretical rationale for government to provide a particular service, rather than leaving it to the market (in particular, by reference to the well-developed market failure criteria of economic theory).

Analysis is also used in evaluation to interpret performance measures. This is true in dealing with the common outcome measurement problem mentioned above: that available outcome measures do a poor job of distinguishing the actual outcome of the program from the impact of “external” factors. When this is the case, analysis, using data on what is happening to the relevant external factors, is likely to be an essential tool for forming a judgment about the effectiveness of the program concerned.

Over the post-war period, enthusiasm for evaluation has waxed and waned, and the approach to evaluation has changed significantly. From the 1960s into the 1970s, the worldwide influence of the original US program budgeting system (PPBS) led to enthusiasm for systematic and comprehensive “scientific” evaluation. The idea, as mentioned in Chapter 1, was to make full use of all the tools of economic and social research—cost-benefit analysis, cost-effectiveness analysis, and information science generally. In 1978 one US observer noted that “over the past 15 years, a new industry has been created to evaluate the performance of government programs,” with the federal government alone spending $200 million a year on evaluations (Wholey, 1978, p. 52). By that time, however, there was already a “growing disillusionment with social science evaluation” in government (Floden and Weiner, 1978, p. 367). In part, this was because it became clear that the claims made for scientific evaluation were exaggerated because, for example, it was difficult to handle uncertainty and intangible costs and benefits (Toulemonde and Rochaix, 1994, p. 49). In addition, precisely because of their thorough, “scientific” quality, evaluation reports often took so long to produce that they lost their timeliness and were “too costly and time-consuming compared to their real use and effect” (OECD, 1998, p. 3). Bounded rationality was the other key problem—the mass of analytic paperwork produced under PPBS was far too great for decision-makers to use. As a result, support for program evaluation fell, and by the 1980s in the US the large “program evaluations divisions” which had existed in every government department had “all but vanished” (Weinstock, 2003).

Evaluation has enjoyed subsequent waves of popularity in certain countries. In the US, evaluation requirements were part of the Government Performance and Results Act of 1993, leading to some renewal of evaluation activity, although not on the scale of the PPBS era. In Australia, a system of mandatory evaluation, in which departments were required to evaluate all their programs over a five-year cycle, was introduced in the mid-1980s, as part of a program budgeting system, and lasted till the mid-1990s (Mackay, 2004; Johnston, 1998; Department of Finance and Australian Public Service Board, 1987; Department of Finance, 1994). In Chile, program evaluation has been a core part of the performance budgeting systems in recent years (see Chapter 13). More recently, Canada embarked in 2006 on a project to re-engineer its budgetary expenditure priority-setting processes, and to make a revitalized system of program evaluation a central part of the new system. In these and other countries, however, the type of evaluation being employed tends to differ greatly from the practices of the earlier PPBS period. The contemporary emphasis is more on timeliness, relevance and cost-effectiveness in evaluation methodology, and less on comprehensive “scientific” methodology.

A key informational issue for performance budgeting remains, therefore, that of the appropriate role of program evaluation in budget decision-making, and of the nature of that evaluation. There are those today who think of performance budgeting as being exclusively concerned with the use of performance measures in budgeting. However, given the limits on performance measures in isolation, there is a strong argument for viewing measures as one element of a broader set of performance information which also includes program evaluation. A key challenge is then to ensure that evaluation is practical, useful and cost-effective.

Implementing performance information systems

As indicated at the outset, this chapter is not intended to serve as a manual for building a performance information system to support the introduction of performance budgeting. Nevertheless, some key points about implementation strategy may usefully be made.

The essential first step, before any attempt is made to identify indicators to be measured or to establish evaluation processes, must always be the explicit specification of the key objectives which the government is seeking to achieve. Expressed differently, it is necessary to clearly state the desired outcomes for each of the major categories of government outputs. The articulation of intended outcomes is an exercise which can usefully be combined with the development of a draft program structure for each ministry/agency.

It is necessary to be realistic both about how much information is to be provided, and about the timeframe for the development of the performance information system. The costs of performance information make it essential that, even in the most fully developed performance information system, there be selectivity in the choice of performance data which are to be gathered.24 When embarking for the first time upon the development of performance indicators, even greater selectivity is necessary. It is better initially to identify a small number of indicators which each agency will measure, than to develop long wish-lists of potentially useful indicators. Once experience is gained with a small number of key indicators, the number of measures can be expanded. Selectiveness is even more important in countries with greater-than-normal capacity and resource constraints. As for timing, it is essential to recognize that building even a simple performance information system is a project which will take some years.

There are a range of sources for performance data. The most readily accessible tend to be agency records. Such records will, as a rule, predominantly relate to the volumes of activities carried out and outputs delivered. National statistics, depending upon their scope, can be of considerable value. However, measuring outcomes and output quality is, in general, a more demanding task which cannot be based upon agency records and will require the use of more sophisticated and costly techniques (such as consumer satisfaction surveys). It is therefore entirely appropriate to focus, in the first stage of the development of performance indicators, mainly upon the development of good measures of output quantity and activity. The second stage will generally focus upon the bringing together of results information and accounting systems in order to develop measures of the costs of programs and key outputs (see the next chapter), while recognizing that cost measures will inevitably be very approximate to start with. Only after these stages is it appropriate to make a systematic attack on the more challenging areas of outcomes and output quality.

A key part of any performance information system—no matter how limited or ambitious its scope—is the existence of appropriate mechanisms for the verification and attestation of that information. If performance indicators are not subject to external checks, data manipulation will be the inevitable result. Verification and attestation require appropriate internal control procedures, and also good external audit. The role of external audit in relation to performance budgeting is discussed in Chapter 6.

The development of performance information systems requires significant resourcing and technical support. Collection and processing of data must be computer-based. Appropriately trained professional staff are also essential. Over time, as the scope of the performance information system expands, the capacity requirements to underpin the system will also increase. The quality of professional staff necessary to develop output costing or to develop outcome and output quality measures is, for example, considerably greater than that required for the measurement of output and activity quantities. The conduct of program evaluations also requires adequately trained staff.

Performance budgeting in the government-wide budget requires that the Ministry of Finance and key political leaders receive performance information which is relevant and useful to them in making budgetary decisions. Given this, the selection of performance measures should not be left to the line agencies alone—the result is likely to be the development of a set of uninformative measures which protect the agency from scrutiny rather than facilitating scrutiny. So the center must involve itself intimately in the process of defining what type of performance information it is to be provided with. Indeed, close central involvement must precede this, and start with the process of defining the intended outcomes of key government outputs. At the same time, it is quite inappropriate for the Ministry of Finance to simply dictate to agencies what indicators will be developed. It may sound like a platitude, but it is essential that agencies and the individuals who work within them “own” the performance indicators which are used—that is, that they accept the relevance of them as measures of the results which they achieve. As discussed in Chapter 18, the motivational power of performance information depends in part on such an acceptance and recognition of its validity.


Performance information on the results achieved by public programs is a fundamental prerequisite of performance budgeting, and also of performance management more generally. Clarity about the basic performance concepts is important to avoid unnecessary analytic confusion. Because performance budgeting aims to link funding and results, an understanding of key features of the production chain is important. Three particularly relevant aspects of the production chain for performance budgeting are the impact of external factors for outcomes, the unpredictable time lags which characterize many outcomes, and the heterogeneous nature of many public sector outputs. Performance measurement has advanced far over recent decades in many countries, as has the use of performance measures for a range of purposes. There are, however, major limits to what may be measured. One of the functions of evaluation is to apply a range of analytic techniques to improve on the performance information offered by measures alone.

In the development of a performance information system, it is essential to take a staged approach, starting with the explicit delineation of the intended outcomes of key government services. Initially, performance measurement should focus on the development of a small set of activity and output quantity measures, and only over time should governments go beyond that to output costing measures, program evaluation, and outcome and quality measures. Above all, it is essential to recognize that it is impossible to build performance information systems in a year or two. By its nature, the development of performance information for performance budgeting—and for other “managing-for-results” uses—is a medium- to long-term project. It is, however, a project which can yield some benefits relatively quickly, and others progressively over time.


The author would like to thank Peter C. Smith for his valuable comments on an earlier draft of this chapter.

The activity concept is the basis for “activity-based costing” (ABC), which is briefly discussed in Chapter 4.

Other terminology is often used to make the same distinction—proximate outcomes have sometimes been referred to as “intermediate” outcomes, and high-level outcomes as “final,” “ultimate,” or “end” outcomes. The distinction between high-level and proximate outcomes is, however, often a matter of degree, and it is better to think not of a clear-cut categorization of outcomes but of a continuum of outcomes ranging from the most proximate to the most high-level.

See, for example, UNDP Evaluation Office (2002), USAID (1983). The problem with this is that it can be difficult, at the margins, to distinguish between impacts and outcomes.

The extreme case of which are pure public goods with the property of non-rivalness (zero marginal cost for additional outputs).

And also because the teaching activity requires other support activities to be capable of achieving the intended knowledge outcome.

It should be noted that in this definition of outputs, “external” parties generally means external to the agency rather than external to the government as a whole. The implication of this is to classify as outputs not only services provided to the public, but also services provided by one government agency to another government agency. The services which a finance ministry provides when regulating and oversighting the fiscal management of other ministries are, for example, considered to be an output, even though no direct service is being provided to the public.

If the distinction between outcomes and output quality is not made, one is forced, for example, to brand as “low quality” the treatment received by any road accident patient who dies, even if everything possible was done promptly to save the patient.

For completeness, some add a third concept of “economy,” which reflects “whether the right price was paid to acquire the necessary inputs” (HM Treasury, 2001).

If it is possible to produce the same output with less of one input, while holding constant the amount of other inputs, there is X-inefficiency.

Formally, that the marginal rate of substitution between inputs equals the ratio of input prices.

Allocative efficiency in public expenditure can also be construed more broadly to embrace in addition establishing the balance between publicly and privately-provided outputs most likely to maximize well-being.

Sometimes also known as “contextual” or “confounding” factors.

Holding the technology and degree of efficiency constant.

In fact, in some cases heterogeneity may reflect objectives which imply increased outcome disparity. This is the case, for example, in respect of educational programs which give additional attention to gifted students.

Plus two observer countries, Israel and Chile.

More generally, Jacobs et al. (2006) distinguish three principal methodologies for taking into account the impact of external factors. The first is to compare outcomes of organizations which are confronted with the same, or a very similar, external context. The second is to model the external factors explicitly, analogously to factors in the production process (for example, through regression analysis applied to substantial datasets). The third is “risk adjustment,” a term which they use to refer to the process of defining output categories in a more detailed way to reduce heterogeneity.

There is no need to conduct cost-benefit analysis of products delivered by the market (serious market imperfections aside), because these judgments are in effect made by consumers and other decentralized decision-makers. During the first half of the twentieth century, a number of economists of the “socialist calculation” school (including, most notably, Oskar Lange) argued that a centrally planned economy could be at least as allocatively efficient as a market economy—as well as having other advantages (particularly in macroeconomic stability)—if government planners solved the allocative optimization problem mathematically. This school was so influential that these techniques (especially linear programming) eventually became the centerpiece of Soviet economic planning. As Friedrich von Hayek in particular pointed out, the most fundamental defect with this line of argument was that it is in practice impossible for central planners to obtain (or, even if they could obtain it, to use) the vast amount of information which such planning would require.

This is part of a broader issue of “goal ambiguity” which has received some attention among public administration scholars (see, for example, Rainey, 1993).

For example, the British Treasury defines evaluation as “retrospective analysis of a project, program, or policy to assess how successful or otherwise it has been, and what lessons can be learnt for the future” (HM Treasury, 1997, p. 97).

As Mackay (2004, p. 17) notes, “Another common mistake when designing a performance system is to over-engineer the information—in terms of both quantity and quality—which it will provide. This mistake can be made by having an overly-comprehensive system of performance indicators, or by commissioning a large and costly number of sophisticated evaluations. In many countries this has proved not only wasteful but also counterproductive. Those responsible for providing the M&E information will have no incentive to produce quality information of a timely nature if they perceive the information is not being fully used. And if the quality of information starts to decline, this can further undermine the demand for it.”


    Department of Finance1994The Use of Evaluation in the 1994-95 Budget—Issues for Discussion (Canberra: Australian Government Publishing Service).

    Department of Finance and Australian Public Service Board1987Evaluating Government Programs: Financial Management Improvement Program (Canberra: Australian Government Publishing Service).

    FisherG.H.1967The Role of Cost-Utility Analysis in Program Budgeting,” in Program Budgeting: Program Analysis and the Federal Budgeted. D.Novick (Cambridge, Mass.: Harvard University Press).

    FlodenR.E. and S.S.Weiner1978Rationality to Ritual: The Multiple Roles of Evaluation in Government Processes,reprinted in Public Budgeting1982ed. F.J.Lynden and E.GMiller4th edition (Englewood Cliffs, NJ: Prentice-Hall) pp. 36680.

    HM Treasury1997Appraisal and Evaluation in Central Government: Treasury Guidance (London: Stationery Office).

    HM Treasury2001. Choosing the Right Fabric: A Framework for Performance Information (London: HM Treasury).

    JacobsR.P.Smith and A.Street2006Measuring Efficiency in Health Care: Analytic Techniques and Health Policy (Cambridge: Cambridge University Press).

    JohnstonJ.1998Strategy, Planning, Leadership, and the Financial Management Improvement Plan: The Australian Public Service 1983-1996,Public Productivity & Management Review Vol. 21(4) (June) pp. 35269.

    LeibensteinH.1966Allocative Efficiency vs. ‘X-Efficiency’,American Economic Review Vol. 56 (June) pp. 392415.

    OECD (Organization for Economic Cooperation and Development)1998Best Practice Guidelines for EvaluationPUMA Policy Brief No. 5 (Paris: OECD).

    HM Treasury2005Performance Information in the Budget Process: Results of OECD 2005 QuestionnaireGOV/PGC/SBO(2005)6.

    PradhanS.1996Evaluating Public Expenditure: A Framework for Public Expenditure ReviewsWorld Bank Discussion Paper 323 (Washington: World Bank).

    MackayK.2004Two Generations of Performance Evaluation and Management System in AustraliaECD Working Paper Series No. 11 (Washington: World Bank).

    RaineyH.G.1993Toward a Theory of Goal Ambiguity in Public Organizations,” pp. 12166 in Research in Public Administrationed. byJ.L.Perry Vol. 2 (Greenwich, Conn.: JAI Press).

    SenA.1992Inequality Reexamined (Oxford: Clarendon Press).

    TankersleyW. and G.Grizzle1994Control Options for the Public Manager: An Analytic Model for Designing Appropriate Control Strategies,Public Productivity and Management Review Vol. 18(1) pp. 117.

    ToulemondeJ. and L.Rochaix1994Rational Decision-Making through Project Appraisal: A Presentation of French Attempts,International Review of Administrative Sciences Vol. 60 pp. 3753.

    UNDP (United Nations Development Programme) Evaluation Office2002Handbook on Monitoring and Evaluating for Results (New York: United Nations).

    USAID (United States Agency for International Development)1983The Logical Framework: Modifications Based on ExperienceProgram Methods and Evaluation Division Bureau for Program and Policy Coordination (Washington: USAID).

    WeinstockM.2003Under the MicroscopeGovernment ExecutiveJanuary pp. 3740

    WholeyJ.S.1978Zero-Base Budgeting and Program Evaluation (Lexington, Mass.: Lexington Books).

    Other Resources Citing This Publication