## Introduction

**4.1** To construct a perfectly accurate consumer price index (CPI), the price statistician would need to record the price of every variety of all the goods and services that are in the scope of the CPI. Because it is too costly and, in practice, impossible to regularly record all the prices of the universe in a timely manner, sampling techniques are used to select a subset of prices that eventually enter the index compilation. Consequently, a CPI compilation is based on samples.

**4.2** Sampling occurs on several different levels in the CPI:

Geographic (locations) and outlet samples: all places and outlets where a product is sold

Product samples: all goods and services available for purchase

Time: the subperiods of the index

**4.3** In practice, CPI sampling follows a multistage approach. The universe of products is structured by first selecting the items within the different categories of the expenditure classification. For each item of the CPI classification, one or more representative varieties can then be sampled. For the geographic and outlet samples, specific locations for price collection are selected first. In a second step, outlets are selected within the sampled locations. The specific varieties to be priced are ideally selected within the sampled outlets. Finally, time can be considered as another level of sampling as it must be decided at which moment during the reference month prices are observed.

**4.4** Within each of the different sampling stages, either probability or nonprobability sampling methods can be considered, possibly in combination with stratification. If the sampling frames required for probability sampling are not available, the price statistician either creates one or relies on nonprobability sampling techniques. For this reason, nonprobability sampling techniques are commonly used to draw samples for price collection in the CPI. However, the use of some form of probability sampling is generally the preferred option as it avoids the need for arbitrary decisions and ensures unbiased results. In addition, there are practical considerations when organizing the price collection in the field. In practice, different sampling approaches may be adopted for different parts of the CPI basket. It is thus essential that sampling procedures are clearly defined and well documented.

**4.5** When sampling designs are planned, the full universe of locations, outlets and outlet types, items, and varieties belonging to the scope of the CPI should be considered. All significant parts of that universe should be appropriately represented. An additional challenge is that representativity is not static but evolves over time. The samples that were initially designed for the price reference period may not be fully representative of the current period. Therefore, samples should be continuously monitored and updated as needed. Chapter 7 describes in more detail the challenges of a dynamic target universe.

**4.6** The sample design should support the publication of the detailed subindices that have been agreed for dissemination, such as regional indices or separate subindices for urban and rural areas. Sample designs should be efficient to maximize sampling precision while minimizing the costs for fieldwork and processing. Even without any formal variance estimation (as described in 4.83–4.94), sample allocation should be optimized by taking into account the weight and the magnitude of the price change variance of the subindices.

## Sampling Techniques

**4.7** In survey sampling theory,^{1} there is a distinction between the parameter and the estimator. In the context of a CPI, the parameter is the target price index number that is based on prices and quantities of the products that belong to the universe. The estimator is the price index that is actually compiled using the sampled data as input. The result of the estimator depends on the price index formula that may or may not use weights, and on the sampling scheme that has been adopted for selecting the varieties for which prices are collected. In practice, the parameter is unknown although simulations can be conducted, for example, using scanner data to study the performance of different sampling strategies.

**4.8** In assessing the quality of a sample estimator (that is, how well it estimates the parameter), two measures are often considered in the case of probability sampling. The first measure is the bias of the estimator, which is the difference between the universe parameter and the average of the estimator over all possible samples that could be drawn under the specified sample design. An estimator is unbiased if it has zero bias. The second measure is the variance of the estimator with respect to the sampling distribution.

**4.9** An estimator is considered accurate if both its bias and variance are small; that is, the estimator is on average very close to the parameter and does not vary significantly from its mean. The mean square error, defined as the sum of the variance and the squared bias, measures the accuracy of the estimator. In addition, samples should be efficient, so that the maximum sampling precision (and minimum variance) can be obtained for the minimum cost with regard to fieldwork and processing. If bias is a more relevant problem than sampling error, the deterioration in precision caused by small samples can be offset by spending sufficient resources to select and maintain representative samples.

**4.10** There are two approaches that can be used to sample units: probability sampling techniques and nonprobability sampling techniques. Probability sampling techniques require that the probability of a unit being selected as part of the sample is known in advance and is strictly positive for each unit. Units can represent the locations and the outlets where households shop or the items and varieties that households buy. Nonprobability sampling techniques do not rely on sampling probabilities. In practice, a combination of probability sampling and nonprobability (or purposive) sampling is used at the various sampling stages of a CPI. These techniques are often applied together with stratification.

### Probability Sampling Techniques

**4.11** In probability sampling, a sample of *n* units from the universe of *N* units is selected by attaching a nonzero inclusion probability *π _{j}* to each unity. The inclusion probabilities

*p*for each unit in the sample are assumed to be strictly positive and known in advance. This requires the availability of a sampling frame, that is, a list of all the units eligible to be sampled. A frame may suffer from overcoverage if it includes units that are not in the universe or includes duplicates of units. It may have undercoverage if units in the universe are missing from the frame.

_{j}**4.12** In simple random sampling (SRS) and systematic sampling, each unit is sampled with equal probability and *π _{j} = n/N*. In SRS (without replacement), a random number is assigned to each unit in the frame and the

*n*highest (or lowest) values are selected. In systematic sampling, the sampling units are selected at equal distances from each other in the frame, with random selection of only the first unit. These techniques are usually recommended in situations where the units are relatively homogeneous.

**4.13** In probability proportional to size (PPS) sampling, the inclusion probability *x*. Units for which the inclusion probability is greater than one are selected with certainty, while the inclusion probabilities for the remaining units are calculated after excluding the large ones. This technique is recommended if the units are of different sizes, so that larger units have a greater probability of selection than smaller units.

**4.14** In the context of a CPI, the auxiliary variable is typically the household expenditure on the goods and services covered by the CPI. In practice, this variable is often not available and alternative variables correlated to expenditure must be used, such as retail turnover, population of geographic areas, or number of outlets. The extent to which this assumption about the alternative variables holds true eventually determines the quality of the sampling design.

**4.15** In a CPI, the size of a sample is often fixed a priori. It is therefore impractical to let the sample size be randomly fixed by the sampling procedure. Several techniques exist for drawing fixed sample size samples, for example, systematic sampling or simple random sampling, and so on.

**4.16** In systematic PPS sampling, the list of units is first ordered randomly and the cumulative total of the auxiliary variable *x _{j}* is calculated. Selection of units then takes place using interval sampling with the interval value calculated by dividing the cumulative total

*Σ*by the sample size

_{iεN}x_{i}*n*. This is done by generating a random starting point between zero and the interval value to select the first unit. The second random number is generated by adding the interval value to the starting point and used to select the second unit. This process of adding the interval value to the previous random number, and selecting the corresponding units, is repeated until the requisite number of units has been sampled. If the size of a unit is larger than the interval value, then it is selected with certainty. These units are then removed from the sampling frame and the process of cumulating the size variable and compiling an interval value is repeated with the remaining units.

**4.17** Systematic sampling is best explained using an example. Table 4.1 shows how a sample of three outlets can be drawn from 10 outlets. In a first step, the outlets are listed in a random order (column A). Although turnover, or total gross sales of the establishment, would be the preferred selection variable, it is often not available in the sampling frame. An alternative might be the use of the number of employees as a proxy for turnover. The table includes the cumulative sizes and the inclusion intervals (columns C and D). Taking the cumulated (or total) size measure, which is 90 in this case, and dividing it by the sample size, 3, gives a sampling interval of 30. Next, a random number between 1 and 30 is chosen, say 25. The sample will then consist of the outlets whose inclusion intervals cover the numbers 25, 25 + 30 = 55 and 25 + 2 × 30 = 85 (column E).

^{Table 4.1}

**Systematic Sample of 3 out of 10 Outlets, Based on PPS Sampling**

^{Table 4.1}

**Systematic Sample of 3 out of 10 Outlets, Based on PPS Sampling**

Number of Employees (x) | Cumulative x | Inclusion Interval under PPS | Included when Starting Point is 25 | |
---|---|---|---|---|

(a) | (b) | (c) | (d) | (e) |

Outlet 1 | 13 | 13 | 0–13 | |

Outlet 2 | 2 | 15 | 14–15 | |

Outlet 3 | 5 | 20 | 16–20 | |

Outlet 4 | 9 | 29 | 21–29 | X |

Outlet 5 | 1 | 30 | 30 | |

Outlet 6 | 25 | 55 | 31–55 | X |

Outlet 7 | 10 | 65 | 56–65 | |

Outlet 8 | 6 | 71 | 66–71 | |

Outlet 9 | 11 | 82 | 72–82 | |

Outlet 10 | 8 | 90 | 83–90 | X |

^{Table 4.1}

**Systematic Sample of 3 out of 10 Outlets, Based on PPS Sampling**

Number of Employees (x) | Cumulative x | Inclusion Interval under PPS | Included when Starting Point is 25 | |
---|---|---|---|---|

(a) | (b) | (c) | (d) | (e) |

Outlet 1 | 13 | 13 | 0–13 | |

Outlet 2 | 2 | 15 | 14–15 | |

Outlet 3 | 5 | 20 | 16–20 | |

Outlet 4 | 9 | 29 | 21–29 | X |

Outlet 5 | 1 | 30 | 30 | |

Outlet 6 | 25 | 55 | 31–55 | X |

Outlet 7 | 10 | 65 | 56–65 | |

Outlet 8 | 6 | 71 | 66–71 | |

Outlet 9 | 11 | 82 | 72–82 | |

Outlet 10 | 8 | 90 | 83–90 | X |

**4.18** PPS sampling has the advantage of selecting the sample in proportion to the relevant variable. It ensures a sample which reflects the heterogeneity of the population, and the sample does not need to be rebalanced by reweighting. Alternatively, if each unit could be given an equal chance of selection in the sample, then reweighting may be necessary. For example, SRS would give a large outlet the same chance of selection as a small independent shop, despite the enormous differences in turnover, and so reweighting would be needed.

**4.19** Assuming that the prices used to compile the index have been obtained using a specific probability sampling design, it can then be shown how common price indices are estimators for certain population price indices (see Balk [2005] for details). If the size variable in PPS sampling corresponds to the expenditure on the variety in the base period, then the sample Jevons price index is an approximately unbiased estimator for the population geometric price index. Similarly, if the size variable corresponds to the quantities in the base period, then the sample Dutot price index is an approximately unbiased estimator for the population Laspeyres price index. These results suggest that varieties should be sampled based on expenditures if using a Jevons index whereas for a Dutot index, varieties should be sampled based on sold quantities. Finally, the Carli index is an unbiased estimator for the population Laspeyres price index if the size variable corresponds to the expenditures in the base period (the different index formulas are discussed in Chapters 1 and 8).

**4.20** These estimators are “approximately” unbiased in the sense that the bias tends to zero if the sample size increases. It can be shown that the finite sample bias of the Jevons index is always positive. Under SRS, the sample Jevons price index is on average larger than the population Jevons price index. For the Dutot index, the sign of the finite sample bias is undetermined, although its magnitude is often negligible. However, for the Jevons index the bias may be significant if sample sizes are very small. Thus, when using the geometric mean, very small sample sizes should be avoided (see Bradley [2007]). Increasing the sample sizes will not only reduce finite sample bias, but also reduce the variance caused by sampling. As a rule of thumb and unless price change variance is very low, at least eight to ten observations for each elementary aggregate should be selected for the compilation of the Jevons index.

### Nonprobability Sampling Techniques

**4.21** In many circumstances, detailed sampling frames are not available, and nonprobability sampling techniques are applied, because they do not necessarily rely on sampling frames. Moreover, such techniques allow, to a certain extent, more control over the samples than random approaches. For example, the costs and the feasibility of price collection play a key role when designing samples for a CPI, especially given that the sample has to be priced continuously over the next periods. Examples of nonprobability sampling techniques that are often applied in a CPI context include cutoff sampling, quota sampling, and the representative item method.

#### Cutoff Sampling

**4.22** Cutoff sampling refers to the practice of choosing the *n* largest sampling units with certainty and giving the remaining units a zero chance of inclusion. To apply cutoff sampling, a sampling frame is required that contains an appropriate variable measuring the size of the unit. Moreover, the quality of that variable has to be judged when performing cutoff sampling. The term “cutoff” refers to the threshold value between the included and the excluded units. Unlike with probability sampling, there is no randomness involved. All units above the threshold are selected with certainty, whereas those below the threshold are excluded.

**4.23** The disadvantage of cutoff sampling is that it does not produce unbiased estimators, since the small units may display price movements that systematically differ from those of the larger units. Any estimator resulting from cutoff sampling has zero variance because there is only one sample that can be drawn for a given threshold. With regard to mean square error, cut-off sampling might be a good choice if the variance reduction more than offsets the introduction of a small bias.

**4.24** Cutoff sampling can be considered when the size variable is highly skewed. The goal is to select the largest units that are not so numerous and disregard the many small units. This design is also useful when the “below cutoff section” of the universe is considered insignifcant and difficult to measure. A CPI practice considered as cutoff sampling occurs when the price collector selects the most sold product in an outlet, within a centrally defined specification. In this case, the sample size is one (in each outlet).

**4.25** Table 4.2 shows the three largest outlets that have been selected using cutoff sampling, with the size of the outlet being measured by its number of employees (column B). This cutoff sampling technique allows covering 54 percent of the total number of employees (column D). If the cutoff rule is defined to cover at least 70 percent of the total number of employees, two additional outlets must be selected. Instead of fixing the sample size, the cutoff rule can be defined with regard to the cumulative size variable for a certain threshold. Contrary to PPS sampling, an outlet with a small number of employees will systematically be excluded in cutoff sampling. At the same time, this technique ensures that all the largest units are always covered.

^{Table 4.2}

**Cutoff Sample of 3 out of 10 Outlets**

^{Table 4.2}

**Cutoff Sample of 3 out of 10 Outlets**

Number of Employees (x) | Cumulative x | Cumulative x (percent) | Included in Cutoff Sampling When n = 3 | |
---|---|---|---|---|

(a) | (b) | (c) | (d) | (e) |

Outlet 6 | 25 | 25 | 28 | X |

Outlet 1 | 13 | 38 | 42 | X |

Outlet 9 | 11 | 49 | 54 | X |

Outlet 7 | 10 | 59 | 66 | |

Outlet 4 | 9 | 68 | 76 | |

Outlet 10 | 8 | 76 | 84 | |

Outlet 8 | 6 | 82 | 91 | |

Outlet 3 | 5 | 87 | 97 | |

Outlet 2 | 2 | 89 | 99 | |

Outlet 5 | 1 | 90 | 100 |

^{Table 4.2}

**Cutoff Sample of 3 out of 10 Outlets**

Number of Employees (x) | Cumulative x | Cumulative x (percent) | Included in Cutoff Sampling When n = 3 | |
---|---|---|---|---|

(a) | (b) | (c) | (d) | (e) |

Outlet 6 | 25 | 25 | 28 | X |

Outlet 1 | 13 | 38 | 42 | X |

Outlet 9 | 11 | 49 | 54 | X |

Outlet 7 | 10 | 59 | 66 | |

Outlet 4 | 9 | 68 | 76 | |

Outlet 10 | 8 | 76 | 84 | |

Outlet 8 | 6 | 82 | 91 | |

Outlet 3 | 5 | 87 | 97 | |

Outlet 2 | 2 | 89 | 99 | |

Outlet 5 | 1 | 90 | 100 |

#### Quota Sampling

**4.26** Quota sampling is another nonprobability sampling technique used in the CPI. Many product groups, even rather small ones, are quite heterogeneous in nature, and the price varies according to many subgroups or characteristics. There may be different price movements within such a product group, and a procedure to represent the group by one or a few tightly specified item types may then carry a substantial risk of bias.

**4.27** In quota sampling, the sample selection is made using judgmental procedures with respect to known and relevant characteristics when choosing products to price or the type of outlet to select. The sample is drawn to have the same proportions as the total population or universe to ensure that the sample is representative. Consequently, the sample is self-weighted. Quota sampling is a stratified sampling method with a sample allocation that is proportional to the stratum weights and where the sampling within a stratum is conducted in a judgmental way.

**4.28** The following example illustrates the concept of quota sampling. If there is a characteristic that allows dividing a product group into three product types (for example, red delicious apples, granny smith apples, and Honeycrisp apples), each representing 30, 20, and 50 percent of household expenditure, then the sampled prices will also represent the same proportions. If there are 10 prices to be collected, a quota sampling approach recommends that three prices are collected for the first product type, two for the second and five for the third. If the Dutot index is used, then quantities should be used instead of expenditure for defining the importance of the strata.

#### Representative Item Method

**4.29** In the representative item method, one or several item specifications are defined to represent the item. For example, spaghetti may be a representative item for “Pasta.” In this example, the price changes for pasta will be measured using the sampled price changes for spaghetti. According to this method, only varieties matching the specification are priced and no varieties falling outside the specifications will enter the index. The sampling of representative items for a CPI is usually purposive or judgmental, because of the lack of adequate sampling frames that formally list all the possible product types.

**4.30** If the representative item method is used for sampling products, the number of items must be sufficiently large to properly represent the diversity of products that can be found in a given product category. Prices that belong to the same representative item may be relatively homogeneous, both with regard to price level and price change. What matters is the variance in price changes between the different representative items that could be selected. Sampling variance can be reduced by specifying more representative items in those product categories where the price change variance between the representative items is large.

### Stratification

**4.31** A common sampling technique used in the CPI is to divide, in advance, the universe into disjoint homogeneous subpopulations or strata. An independent sample of appropriate size is then selected for each stratum using any of the probability or nonprobability sampling techniques previously described in this chapter. Stratified systematic random sampling is a sampling technique in which systematic random sampling is conducted independently within each of the predefined strata.

**4.32** Stratification can be applied to all of the different samples relevant for the CPI. In practice, the CPI is typically stratified by geography (for example, region, city, rural, or urban), by type of outlet or by outlet, and by item/ product. Samples are set up within each geographic area, for each of the different outlet types, and for each item/ product within outlets. For example, suppose that the CPI classification contains the item refrigerators that can be stratified by region. In a given region, refrigerators are primarily sold in larger specialized shop chains and in several small independent shops. In addition, refrigerators can be classified according to their capacity. Such a stratification structure is illustrated in Table 4.3. Within each cell, prices can be sampled using a two-stage approach. In the first step, specialized chains and independent shops that sell this type of goods are selected within the different regions. In the selected outlets, specific refrigerator models that meet the product-type specifications are then identified for continuous pricing.

^{Table 4.3}

**Stratification by Region, Outlet Type, and Product Type**

^{Table 4.3}

**Stratification by Region, Outlet Type, and Product Type**

Type of Outlet Product | Region 1 | Region 2 | |||
---|---|---|---|---|---|

Specialized Chains | Independent Shops | Specialized Chains | Independent Shops | ||

Refrigerator Capacity | Small Capacity Large Capacity |

^{Table 4.3}

**Stratification by Region, Outlet Type, and Product Type**

Type of Outlet Product | Region 1 | Region 2 | |||
---|---|---|---|---|---|

Specialized Chains | Independent Shops | Specialized Chains | Independent Shops | ||

Refrigerator Capacity | Small Capacity Large Capacity |

**4.33** Formally, if a subcomponent of the CPI is composed of several strata and the subindex for stratum *k* is labeled *I _{k}*, then the subcomponent’s estimate is obtained using the strata’s weights and indices:

where *w _{k}* is the weight of stratum

*k*. The weight of the stratum corresponds to the expenditure of that stratum, and not only to the (smaller) expenditure of the locations, outlets, or varieties that have been sampled within that stratum.

**4.34** Ideally, the stratification should be designed to minimize the sampling error. The strata should be constructed so that the variance of the price changes within strata is low, while the variance between the strata can remain large. A low-price change variance within stratum also has the advantage that results obtained using different price indices are likely to be similar. Consequently, the choice of the elementary aggregate price index will have a smaller impact on the final results.

**4.35** Stratified sampling avoids selection bias by ensuring that the universe is appropriately represented, as it controls for the fact that prices must be collected for each specified stratum. Without stratification, certain parts of the universe may be completely ignored. In addition, the use of explicit weights, if available, guarantees that the prices are weighted according to the importance of each stratum. In price statistics, stratification thus helps reduce the bias of the estimates.

**4.36** Stratification is a useful strategy to make the sample more efficient. The allocation of the number of prices across the strata is further discussed later in this chapter (paragraphs 4.83–4.94). Furthermore, the availability of detailed subindices can be convenient for meeting specific publication needs. The impact of stratification on sample sizes using different elementary aggregation formulas has been analyzed by De Gregorio (2012).

**4.37** Although the universe can exhaustively be divided into a set of strata *K*, there are circumstances where only a sample of these strata *S(K)* is used in practice for index compilation. Only prices belonging to the sampled strata *S(K)* will be collected. The strata can be selected purposively. One strategy could be to select only the most important strata, which is equivalent to running a cutoff sampling procedure. The bias that results from this process depends on the weight of the omitted strata and on the difference between the price change of the omitted strata and the sampled prices. If the weight of the omitted strata is small or their price change is similar to the price change of the sampled prices, then the bias may be small.

**4.38** If the strata composing the index structure are selected using probability sampling, then the aggregation of the stratum indices should be adjusted using sampling weights. In particular, if expenditure is used as the size variable in PPS sampling, the higher-level estimate is compiled as the simple arithmetic average of the sampled stratum indices. In the example shown in Table 4.1, three outlets were selected using PPS sampling. If outlets 4, 6, and 10 each represent a stratum in the index structure, then the weight associated with each stratum does not correspond to the actual weight of that outlet. Instead, an unbiased estimate is obtained by equally weighting the three stratum indices:

**4.39** A challenge arises in systematic PPS sampling when the size of a stratum is larger than the interval value. In such a situation, a distinction must be made between the larger strata that are selected with certainty (“take-all” strata) and the other strata that are sampled after the larger units have been removed (the “take-some” strata). The higher-level index is then obtained as a weighted average of two subindices. The first subindex covers all the certainty strata, while the second subindex contains a sample of some of the lower weighted strata. In the first subindex, the actual weights of the certainty strata must be used in the aggregation, as these strata only represent themselves. In the second subindex, the remaining sampled strata are aggregated with equal weights.

## Sampling Stages in a Consumer Price Index

**4.40** In practice, national statistical offices often adopt four levels of sampling in a CPI: locations, items within different sections of the expenditure classification, outlets within locations, and item varieties.^{2} These four steps are discussed in the following paragraphs. The product level is related to the outlet level. At some point, the sample of items must be matched to the sample of outlets to decide which items are priced in which outlets. Time is an additional sampling level that must be considered. If timing is relevant, it may be necessary to incorporate timing elements in the outlet or item specifications.

### Sampling of Locations

**4.41** The objective of the first sampling stage is to select the locations where prices are going to be collected. A location is an area where outlets that sell products to households can be found. Ideally, a large part of the CPI product basket should be available to purchase within each location. To obtain a representative sample of outlets, it may be necessary to create locations by combining, for example, out-of-town shopping areas that mainly sell nonfood items with a neighboring urban area in which food is available. The sampling of locations can be omitted if outlets are directly sampled independently of where they are located.

**4.42** To define the list of possible locations, a decision must be made on the geographical scope of the price collection areas. For example, price collection can be limited to the capital city only or to the main cities of a country or of a region. Some countries restrict price collection to urban areas because of the challenges and costs of collecting prices in rural areas. Alternatively, it can be decided to select locations belonging to both urban and rural areas. Under the broadest geographical scope, locations must be sampled to be representative of the national territory as a whole (as further discussed in Chapter 2).

**4.43** The sampling of locations is typically conducted using regional stratification. Location selection can then be done separately within each region. For instance, the strata used for sampling can be aligned with the administrative subdivisions of a country and considered the distinction between urban and rural areas. In principle, stratification works best if the locations of a stratum are homogeneous. In that sense, the territory could also be partitioned by grouping together the locations that are situated close to each other or that have similar sociodemographic characteristics. If no regional stratification is used, locations are sampled directly at the national level.

**4.44** Within each region, some locations may be selected with certainty, and the selection process is not random. This can be the case of the main cities. Locations can also be selected randomly, using, for example, systematic PPS sampling. To apply this kind of technique, a variable that measures the size of the locations must be available and the number of sampled locations within a region must be fixed in advance. As the information on expenditure made by households in outlets situated in a given location is rarely available, proxy measures must be used. The number of households living in a location can be used as a size variable, although the residence of a household does not necessarily coincide with the place of purchase. Population size can be obtained from population registers or from a recent census. The number of locations to be selected in each region can be determined as the proportion of national expenditure, income, or a corresponding proxy measure, such as regional gross domestic product, in that region, multiplied by the total number of locations to be visited nationally.

**4.45** There are operational implications for setting up a new price collection in the sampled locations, and the process described previously may need to be modified for practical considerations. For example, because data collection costs are important, it may not be efficient to send a price collector to a small isolated location where many of the items in the CPI basket are not available for pricing. Location resampling might be needed, although it is likely to be conducted on a less frequent basis than outlet or product resampling. Sample coordination techniques can be used to adjust the initial selection probabilities to increase the likelihood of an overlap between the old and the new sample of locations.

### Sampling of Items

**4.46** Cutoff sampling can be applied to select the items that comprise the CPI basket. Expenditure data on items are typically obtained from a household budget survey (see Chapter 3 for more information on data sources). However, not all items for which expenditure data are available might be included in the CPI basket. There may be low-expenditure items for which it would not be an efficient use of limited resources to collect prices (although they are within the scope of the CPI). Thresholds can be defined to identify the low-expenditure items that will not be included in the CPI sample.

**4.47** In practice, an expenditure share is estimated for all the items that could be included in the basket. If this share is below a given threshold, the item could be excluded. For example, it might be decided to exclude from the index items with an expenditure share lower than 0.1 percent for food items and 0.2 percent for nonfood items. A lower minimum threshold for the food items might be set because the prices of these products tend to display a higher variability and because prices for food products are usually less costly to collect. If an item is excluded, its expenditure may be redistributed to another expenditure group with similar content and price development.

**4.48** The cutoff process can be conducted at the national level or separately within a region. The list of items used in each region may differ. This approach ensures that the regional baskets are representative of the consumption structure within each region. Manual adjustments may also be needed so that an item that falls below the threshold can nevertheless be included in the basket. For example, it may not be so costly to collect a few additional prices if these are collected together with prices for more important items. The additional costs may be acceptable given the advantage of achieving wider product coverage.

**4.49** The descriptions (or detailed specifications) of the items that are included in the CPI basket can be general or more detailed. This largely depends on how detailed expenditure data were recorded in the household budget survey. If information is available from other sources, a finer stratification can be developed. In any case, the universe of varieties that fall within the scope of an item is often quite large. Consequently, additional sampling is required at this level for the selection of the specific varieties to be priced over time.

**4.50** To provide effective control over the sample of prices collected in the field, a list of product types (items) can be drawn up by the national statistical office using specifications. This is what is referred to as the representative item method (as described in 4.29–4.30). The product group is thus represented by a sample of specified product types (items). The selection of representative items at a central level ensures that they are representative of consumer behavior, rather than of one shopping location. Consistent with the principle of a fixed basket, the representative items are usually reviewed periodically at the same time as the weights and chain linking is carried out. The disadvantage of this method is that the national specification may often be unavailable in some locations and does not reflect regional tastes and preferences.

**4.51** A representative variety can be defined using either loose or tight specifications. For example, it may be decided to collect prices for “Tea,” which is a loose specification, as it encompasses many different tea varieties that could be selected to be priced. Alternatively, the representative variety may be specified tightly, as a specific variety of tea defined by a brand, favor, and package size. The discussion between loose and tight specifications has implications for the choice of the elementary aggregate formula (see Chapter 8 for more information).

**4.52** If specifications are too tight, the price collectors may have difficulty in finding the varieties, and this will lead to fewer price quotes and a deficit in the sample. There is also a risk that tight specifications will miss important parts of the market, and consequently, results can be biased. For these reasons, the representative variety method is more suitable for homogeneous product groups. In heterogeneous groups, it is more likely that important segments of the item universe, with different price movements, will be left out.

**4.53** Loose specifications give the price collector some freedom in choosing locally popular items and varieties and to adjust the sample to match local conditions. It will normally lead to greater representativity of the sample, as it will reflect regional tastes and preferences. However, this approach may require training for price collectors on how to select within the outlet a variety to be priced that matches the loose specifications. This requires a more formal process selecting the varieties within an outlet. Even with loose specifications, the variety selected in the outlet must be described in sufficient detail by the price collector to ensure that the same variety is priced over time, and in case of a replacement, a correct quality adjustment can be made.

### Sampling of Outlets

**4.54** Different data sources can be used as sampling frames for selecting outlets. In the CPI, an outlet should be defined broadly to include any retail channel that sells goods or services to households, including different types of physical shops, online shops, or the offices of providers of utilities even when a physical shop that customers can visit is not available. Outlets to be visited in the field may be selected only in locations that were previously identified as price collection centers.

**4.55** Business registers can be a starting point for selecting outlets. Business registers are databases that list the business units resident in a given area and contain information such as location, type of business activity, turnover, and employment. Business registers should be updated regularly for new businesses and for closures. One limitation of business registers is that, in some cases, data are available only at the level of establishment or enterprise, and the information at the local level is not always available or regularly updated. Provided that some size measure is included in the database, such as number of employees, business registers can be used for selecting a PPS sample. Similarly, tax data such as value-added tax records may be used as a data source that has a wide coverage and contains timely information on outlets. Tax records typically refer to a legal unit that can be composed of one or several outlets. With tax data, the selection of random samples can usually be based on gross sales as the measure of size.

**4.56** Administrative records kept by local government, business associations, or marketplace managers are another potential data source for outlet sample selection. These records can be used to create sampling frames and might be particularly useful for sampling local markets. Depending on the availability of information on size, these sources might provide a frame for PPS sampling. Business telephone directories usually contain less information, such as the business name, address, and activity, and do not include any size information. Therefore, telephone directories are useful for simple random sampling or systematic sampling, but not for PPS sampling unless additional information is made available, for example, by visiting the outlets.

**4.57** Information on outlets can also be obtained from surveys designed for this purpose. For example, the objective of point-of-purchase surveys is to collect information on the place where purchases are made by households and on the amount of these purchases. In some cases, the household budget survey includes information on the outlets or type of outlet from where the households purchased the goods and services recorded. Surveys that include this information could be used as a source of relevant data on outlets for sampling frames. However, as these surveys are based on a sample, they are unlikely to provide an exhaustive list of all outlets and the results obtained may become quickly out of date.

**4.58** If the data sources described previously are not available, it is necessary to enumerate the outlets within a selected location to provide supplementary information for the sampling frame or even to construct a sampling frame. This enumeration can be made by price collectors and their supervisors, visiting each location and noting details (for example, name, type of outlet, and items sold) of all retail outlets found. This can be costly, and the price statistician will need to compare the costs and benefits of a more representative sample. To reduce costs, the number of outlets enumerated per location may be limited by enumerating only a subdivision of a location. The ideal size measure of an outlet is turnover. However, a proxy may be used when an estimate of turnover is not available. For instance, the approximate net retail floor space estimated by the outlet enumerators may provide an adequate alternative. Additionally, it is important that the assortment of goods and services sold by the outlet is recorded, to allow matching the outlets with the CPI product basket.

**4.59** If none of the formal sampling techniques are feasible, outlets must be selected in a more judgmental way. For example, field staff can identify the largest retail and service outlets where households typically shop for a given type of product. Additional guidance may be provided to the price collector by specifying the type of outlet or by defining the area where outlets are to be selected, to ensure that the overall sample of outlets remains representative of where households shop. This sampling method often involves visits by the price collector to the outlet to gather the necessary information on the outlet’s characteristics before deciding which outlets are included in the sample and before the first price collection is conducted. For example, if prices for children’s clothes are to be collected, a price collector may visit a clothing shop initially selected from a central list to verify and check whether it sells clothes for children and not just for adults.

**4.60** The sampling strategy for outlets typically relies on stratification by outlet type. Within each stratum, outlets are sampled randomly or purposively if no appropriate sampling frame is available. For example, the number of outlets to be sampled in each stratum could be proportional to the weight of that stratum. In any case, all significant outlet types should be covered. If relevant, the internet can also be considered as an outlet type or stratum from which specific websites are selected.

**4.61** As similar price trends can be expected in the outlets that belong to the same chain, it may be useful to also stratify by retail chain. For example, the sample design could include a first stratum that contains all the major retail chains as substrata and a second stratum that covers the remaining independent shops. In this case, outlets can be selected separately for each retail chain and from the remaining universe of independent shops. The index should be designed so that the retail chains and other outlet types are properly represented according to their market share.

### Sampling of Varieties

**4.62** In a final step, the selection of specific varieties within outlets is usually made by the price collector in a purposive way. The objective is to select the most sold variety that fits the item specifications. This process essentially corresponds to a type of cutoff sampling. In addition to being a high-volume seller, the variety is also expected to be available over time so that the same variety can easily be priced during the following periods.

**4.63** A compromise must sometimes be made between representativity and comparability. On the one hand, it is important to select varieties with high sales, as this will ensure that the samples are representative of what households are buying. On the other hand, the future availability of the variety must be considered to improve comparability over time by minimizing the need for replacements. However, there is a risk of bias if the choice of varieties is mainly driven by the convenience to collect. Chapter 5 discusses the price collection in more detail.

**4.64** With broad item specifications, many products available in an outlet can be selected for inclusion in the CPI sample. A more formal approach may be adopted by designing a process that approximates PPS sampling to the sales of each product.^{3} To apply this technique, all possible products or product types (items) must first be enumerated, and selection probabilities are determined for these products based on their sales, for example, using the information on sales obtained directly from the respondents. Alternatively, shelf space can be used to estimate sales proportions in some cases. Even with no information on sales, random sampling can still be conducted by simply assuming that all products have the same selection probability.

**4.65** If the data for the sales measure is taken during a very short period, this may coincide with a special promotion or sale with temporarily reduced prices. If a variety with a temporarily reduced price is given a larger probability for selection, and if this price is likely to increase more than average, an overestimating bias may result. It is thus essential that the sampling takes place before the first price collection or that sales values from an earlier period are used. Generally, annual sales values would be the best measure.

**4.66** Administrative data sources are sometimes available with detailed expenditure data by product or by product group. Such data sources can help build detailed stratifications and serve as a sampling frame for selecting the individual varieties to be priced.

**4.67** If no explicit weights are used in the elementary aggregate compilation, the sample will be implicitly weighted by the number of observations. For example, if prices of the same two varieties are collected in five outlets, then each outlet and variety implicitly have the same weight. The elementary aggregate is often the lowest level for which weights are available. However, efforts can be made to improve the accuracy of this detailed subindex using implicit or explicit weights. The sample can be designed so that the proportions within an elementary aggregate represent those of the universe with respect to the outlet type or outlet, if these aspects are not explicitly weighted in the CPI aggregation structure. Similarly, the sample should be well balanced with respect to product type (item) or item characteristics. For example, if local beer has an approximate market share of 80 percent and imported beer of 20 percent, these proportions can be reflected in the elementary aggregate. Chapter 8 discusses the use of weights within elementary aggregates in more detail.

### Sampling in Time

**4.68** The timing of price collection is chosen purposively in most cases. The main principle is that the prices of each item should be collected each month at the same time, during the same week or the same day of the month. If there is some price variation within a day, it is important that prices are collected always at the same time of the day. Price collection can be limited to a certain week of the month, for example, during the middle of the month, or it can be spread over several weeks of the month according to some pattern, for example, in different weeks in different regions or for different product groups.

**4.69** If prices are known to fluctuate considerably during a month, then an increased price collection frequency should be considered. In this case, it can make sense to collect prices once a week or sometimes once a day instead of once a month, as practical. This could be the case for energy products or for fresh food.

**4.70** Some products have a time-dependent component that should be incorporated into the product specifications. For example, for airfares or other transport services, the price can be highly dependent on the day of the week, the time of the fight, and how long in advance the ticket is purchased. Such timing elements should be held constant to ensure the comparability of the collected prices over time. Moreover, the specifications including their time-dependent components should be defined to be representative of consumer behavior.

## Central Price Collection

**4.71** For some products, prices can be collected centrally by CPI staff. These include those prices obtained from a specific data source or by a survey conducted centrally. The same general principles apply as with price collection in the field in the sense that outlets or respondents must be chosen and specific varieties for regular pricing must be selected. However, there is no operational need to select specific price collection areas in advance, and a broader geographical coverage of price collection can often be achieved. Moreover, it is possible at the head office to design specific sampling strategies adapted to the product or to the data sources at hand. Some examples of central price collection are discussed in the following text.

### Scanner Data

**4.72** When using scanner data, the product and the outlet sampling levels are closely related. A two-stage sampling approach can be adopted where outlets are sampled first, possibly belonging to the same retail chain or outlet type, and then varieties are selected within the sampled outlets. If price levels are similar between outlets or if data are not available by outlet, varieties can also be sampled at a more aggregated level, for instance at the level of an entire retail chain. Eventually, the compilation methods can be adapted to use all the available data on both prices and quantities, and no sampling is thus involved. Even if scanner data cover “all” transactions, these data do not necessarily cover the entire universe. In principle, the remaining part of the universe should also be covered if it is significant, if it is expected to have different price developments, and if there are sufficient resources to maintain separate price indices not derived from scanner data. Weights must then be used at some level of the index hierarchy so that each part is appropriately represented. Chapter 10 discusses the use of scanner data in detail.

### Tariffs

**4.73** For many tariffs, there is often a national pricing strategy and the regional dimension discussed previously becomes irrelevant. There may only be a limited number of providers (sometimes only one), and it is straightforward to select either all of them or at least the most important ones, and these can then be explicitly weighted in the index. For tariffs, it is often possible to obtain additional information on products directly from the providers, authorities, regulatory bodies, or other administrative data sources. Consumer profiles are sometimes defined (sampled) to price the tariff structure. Chapter 11 discusses the different price measurement approaches for tariffs.

### Rents

**4.74** Examples of sampling frames used for rental dwellings are registers of dwellings, lists of addresses, or registers of tenants and landlords. Information may also be available in the microdata of a census of buildings and dwellings. Stratified sampling would be the preferred option to select rents. Typical stratification variables that correlate with rental change are the location, type, and size of the dwelling, the type of rent (social/fixed/regulated rent or market rent), or the type of landlord. Ideally, each stratum should be weighted according to its total rental expenditure and random samples should be drawn within each stratum. If no sampling frame is available and it is too costly to build one, two-stage cluster sampling can be a practical option. The country or the region has first to be divided into detailed geographic subdivisions (“clusters”) from which a sample can be drawn. The in-scope rental dwellings of the selected clusters are then enumerated so that a random sample of rents in each selected cluster can eventually be obtained. An alternative approach is to collect rental data directly from real estate agencies. Chapter 11 examines the treatment of housing in a CPI.

### Chains with National Pricing Strategies

**4.75** Prices can be collected in a single outlet of a larger chain if it is believed that the same price applies in all the other outlets of the same chain across the country. This approach may however be subject to some caveats. First, there is no guarantee that prices are always the same in all the outlets of the same chain. For example, sales or promotions could be outlet-specific. Second, it must be ensured that the varieties selected in one outlet are representative of the consumption habits in the outlets located in the other regions. In any case, the prices must be appropriately weighted in the CPI. If a price quote from a large retail chain is combined with prices from other outlets within an elementary aggregate, an appropriate representation of the retail chain can be achieved through implicit weighting by replicating the price of the retail chain. Alternatively, specific elementary aggregates weighted according to the market shares of the different outlets could be constructed.

### Internet

**4.76** For prices collected on the internet, a distinction must be made between prices collected online, so that there is no need for a price collector to visit an outlet, or those prices that genuinely represent e-commerce transactions (online or web-based outlets with no physical location). In the latter case, one could still adopt the same sampling strategy as for price collection in the field, by first selecting websites and then selecting specific varieties to be priced. The selected websites are typically organized according to different product categories that can serve as strata for product sampling. Instead of collecting prices for only a few varieties, web scraping techniques can be used to obtain significantly more data in an automated way (see Annex 5.6). Prices collected on the internet can sometimes be highly volatile over time with some websites applying dynamic or personalized pricing strategies. This implies that prices should be collected more often. The number of price quotes collected on a website can be larger than the number of prices collected in the physical outlets because of lower price collection costs, a wider product coverage, or an increased price collection frequency. Chapter 11 discusses the treatment of internet purchases.

## Sample Maintenance

**4.77** The samples are drawn at a point in time and based on the information available at that moment. However, the situation in the market is seldom static. Products and outlets gain or lose importance over time. New products constantly enter the market while other products may no longer be available. New outlets can open in a certain location and previously popular outlets close. This means that the sample that was set up at an earlier period may not be representative of household expenditure at the current period.

**4.78** Over time, the quality of the sample can deteriorate. To avoid this deterioration, it is recommended that samples be continuously reviewed and adjusted as needed (see Chapter 7 for more information on maintaining the sample). If resources do not permit a systematic full review of products and outlets, a rolling review can be set up, where every year a different part of the basket is analyzed. The focus should be on products and outlet locations that are likely to be affected by major developments.

**4.79** In practice, feedback received from field staff can be considered to maintain the ongoing representativity of the samples. Price collectors and their supervisors should be encouraged to report if the item specifications used for price collection are no longer in line with the products sold. Through their local knowledge, the staff in charge of price collection may identify new outlets that are becoming popular and should be included in the CPI samples. An increasing number of missing prices can also be an indication that the outlets or the product specifications are no longer appropriate, and action must be taken to update the samples. Some national statistical offices have dedicated product specialists who monitor the markets of complex product groups and can assist with sampling and quality adjustments.

**4.80** Replacements are one way of preserving the representativity of the samples over time. In some cases, price collectors find that an outlet at which they have been collecting prices closes or no longer stocks one of the products being priced. Disappearing outlets and varieties within a specific outlet are handled by replacing on a one-to-one basis when the outlet closes or the variety ceases to be sold. A more proactive approach can also be adopted by anticipating the expected disappearance of an outlet or a variety so that the number of forced replacements is kept to a minimum.

**4.81** The criteria for selecting replacements will vary. An outlet can be replaced on a like-for-like basis by another outlet of the same type, in the same or similar location, selling the same or similar range of items and varieties. In replacing a variety, two strategies are usually adopted. If the initial selection rule was “most sold” or “probability proportional to (sales) size” then the replacement may follow the same rule with the advantage of maintaining the representativity of the sample. However, the replacement variety may no longer be comparable to the variety that it replaces. In this case, quality-adjustment methods must be applied to ensure the index measures pure price changes (see Chapter 6 for more information on quality adjustments). There is a risk of bias if the differences in quality are not properly adjusted. Alternatively, the replacement may be made based on the variety most similar to the one which disappeared, thereby reducing the need for quality adjustment. When replacements are performed, there is an inherent tension between representativity and comparability.

**4.82** The dynamic nature of the target universe can also be addressed through sample rotation involving a full or a partial resampling. Sample rotation is a technique where the length of time that outlets and varieties are included in the price surveys sample is limited. In a given period (for example, every year or every two years), a defined proportion (for example, 20 percent or 25 percent) are rotated out of the sample and replaced with a new sample of outlets and varieties for that proportion thereafter. The process of resampling can be based on any of the methods and sampling frames that are used for initial sample selection. Resampling in the CPI can apply to outlets or even locations. Technically, resampling involves an overlap period where the first period of the new sample overlaps with the last period of the old sample.

## Variance Estimation and Optimal Allocation

**4.83** A CPI is a statistic with a complex sampling design that involves different stages and that often relies on nonprobability sampling methods. Therefore, estimating the variance of a CPI is not a trivial task. Strictly speaking, variance estimation is only applicable in case of probability sampling, as the sampling mechanism is known. To the extent that samples in a CPI are not probability-based, variance estimates must rely on models in which some type of random sampling is assumed. Different approaches for variance estimation in a CPI have been developed, taking into account national circumstances. It is useful to know the sampling error of a CPI as this can inform the users about the level of precision of the published indices. The interpretation of such error margins depends on the specific approach adopted for variance estimation

**4.84** A very general form of a CPI is *k* denotes products, *w _{k}* is the expenditure share of this product, and

*I*is the product price index. If the estimation of each product index was independent, the variance would be

_{k}*V(I*denotes the variance of the product index

_{k})*k*. However, product indices are not statistically independent. For instance, the price change of different products collected in the same outlet can be related. The sampling errors of the product indices are therefore correlated, which means that this expression probably underestimates the total sampling errors to some extent.

**4.85** From an operational point of view, variance estimation is a useful tool for deciding on the optimal allocation of sample prices. Many resources are spent on price collection to produce a CPI. It is therefore worthwhile to devote some effort to allocating these resources in the most efficient way. A better allocation of prices can often minimize the sampling error and consequently improve the quality of the CPI without increasing the resources associated with price collection activities. If a variance and a cost function are known, an optimization program can be defined to minimize the variance given the budget available for price collection. Alternatively, costs can be minimized under the constraint that a minimal variance must be achieved.

**4.86** The general approach to distribute the number of observations across strata was established by Neyman. If each stratum has a known weight *w _{k}* and samples are drawn randomly and independently in each stratum, then the standard deviation of the individual price changes (not the price levels) within a stratum can be denoted by

*s*. In addition, let

_{k}*c*be the cost to collect one unit in stratum

_{k}*k*. If the total sample size is

*n*, the following allocation is optimal:

**4.87** This expression formally shows that the number of observations in a stratum should be proportional to the weight of that stratum times the standard deviation of the price changes and inversely proportional to the square root of the unit cost. The expression can be simplified by assuming that costs are identical in all strata. The underlying idea is that more price observations are needed for those strata with a higher weight and a significant variance in price changes. Additionally, it is sufficient to allocate fewer prices to the strata with a smaller weight and more similar price changes.

**4.88** In practice, there are challenges in systematically applying this type of allocation. While stratum weights are in principle available, the estimation of the standard deviation of price changes may be more problematic. One disadvantage of this approach is that it depends on the target price change. For instance, the prices can be compared to the previous month or to the same month of the previous year. The optimal allocation can be different if it is derived for monthly price changes or for annual price changes. Moreover, the standard deviation of price changes of a stratum is not necessarily constant over time.

**4.89** Data sources, such as previous CPI surveys, can be used to estimate the standard deviations of the price changes of each stratum. However, there is a risk that an estimation based only on past observed prices is biased. Considering the situation in which a stratum is represented by a unique item with very tight specifications, the prices collected in the different outlets may all change at a similar rate, and this will consequently lead to a low sample price change variance. The true population price change variance may be much larger than the estimated because of the unobserved price change behavior of the varieties that belong to the stratum but are not included in the CPI sample.

**4.90** Costs also need to be considered when defining sample sizes. It may be more time-consuming and complex, and, hence, costly, to collect prices for some goods and services than for others. For instance, for products where quality adjustment plays an important role, the price collector can be asked to record, in addition to the price, many technical characteristics. Moreover, some outlets have larger assortments, meaning that more prices can be collected in one outlet, which then reduces the cost per collected price. In practice, the prices could be classified in categories, such as “easy to collect” or “difficult to collect.” Such ordinal information could be used to derive an optimal allocation. A more advanced quantification of costs would be based on estimates of the average time per collected price by product or product group, while considering the travel time from one outlet to another.

**4.91** In Table 4.4, a sample of 100 prices needs to be distributed over five items. A first strategy consists of collecting the same number of prices for each item (option 1). Instead of allocating the same number of price observations to each stratum, an alternative approach consists of linking the number of prices to the weight, collecting more prices in those strata that have a larger weight (option 2). This will improve the sampling error of the subindices that have more impact on the higher-level aggregates because of their weight. In addition, judgments can be made on the magnitude of the variance. This means that the sample sizes should be increased (or decreased) for strata with an expected higher (or lower) price change variance. As an approximation, the Neyman allocation formula is applied by assuming factors of 1 (low), 1.5 (medium), and 2 (high) for the standard deviation. The result of the allocation can be found in the last column (option 3). Because the price change variance is expected to be low for item 3 (as indicated in column C), sample size for this item is decreased whereas the number of prices to be collected for the other items increases. Out of these three options, the third allocation aims at reducing the sampling variance while keeping the total number of prices collected unchanged. Finally, the sample size per stratum should not drop below a fixed threshold to avoid formula bias or to guarantee a sufficient precision if stratum indices are separately published. Totals are shown in bold.

^{Table 4.4}

**Different Allocation Strategies**

^{Table 4.4}

**Different Allocation Strategies**

Item | Weight (percent) | Price Change Variance | Number of Prices | ||
---|---|---|---|---|---|

Option 1 | Option 2 | Option 3 | |||

(a) | (b) | (c) | (d) | (e) | (f) |

Item 1 | 10 | Medium | 20 | 10 percent × 100 = 10 | 11 |

Item 2 | 40 | Medium | 20 | 40 | 43 |

Item 3 | 25 | Low | 20 | 25 | 18 |

Item 4 | 20 | Medium | 20 | 20 | 21 |

Item 5 | 5 | High | 20 | 5 | 7 |

Total | 100 | 100 | 100 | 100 |

^{Table 4.4}

**Different Allocation Strategies**

Item | Weight (percent) | Price Change Variance | Number of Prices | ||
---|---|---|---|---|---|

Option 1 | Option 2 | Option 3 | |||

(a) | (b) | (c) | (d) | (e) | (f) |

Item 1 | 10 | Medium | 20 | 10 percent × 100 = 10 | 11 |

Item 2 | 40 | Medium | 20 | 40 | 43 |

Item 3 | 25 | Low | 20 | 25 | 18 |

Item 4 | 20 | Medium | 20 | 20 | 21 |

Item 5 | 5 | High | 20 | 5 | 7 |

Total | 100 | 100 | 100 | 100 |

**4.92** To design optimal samples, it must be well understood how the different sampling stages contribute to the CPI variance. For example, consider a two-stage sampling design, where outlets are sampled first and then varieties are selected within the sampled outlets. The sampling variance of the price index can be decomposed into a sum of two terms that are linked to these two stages. The first term relates to the variance of the price changes between the outlets, and the second term relates to the variance of the price changes of the varieties that are available within an outlet.

**4.93** If the price changes between the outlets are rather homogeneous but the price changes of the different varieties that are sold in an outlet are more diverse, then the preferred strategy is to select a smaller number of outlets and a larger sample of varieties within each outlet. In addition, such a strategy is cost-efficient as it is often less costly to sample an additional price in an already selected outlet than to increase the outlet sample. For example, in some countries, car prices do not vary much across outlets for the same model and specification, but there are many models with different prices. One car retailer could easily provide prices for several models during the same collection period.

**4.94** The time dimension must also be considered for price collection. If prices vary a lot over time, it is preferable to collect more prices for the same product during a month. For example, petrol prices normally show little variation among outlets on the same day while prices can vary considerably even over a month. It would thus make sense to have a relatively small sample of outlets but follow prices several times in the same month. The time dimension is also highly relevant for internet purchases. As it is probably less costly to consult websites than to visit physical outlets, it can be feasible to improve the temporal coverage of prices collected on the internet.

## Key Recommendations

When designing samples for the CPI, the index compiler must consider the different sampling stages relevant for a CPI: geographic and outlet, products, and time. All significant parts of that universe should be appropriately represented unless there are excessive costs or estimation problems involved in doing so.

The CPI aggregation structure is organized according to a product classification, and possibly stratified by region and by outlet type.

Within each stratum, the collected prices are typically the result of multistage sampling designs. Price collection locations and items are defined beforehand. Outlets that sell the items of interest must then be chosen in the selected locations. The specific varieties to be priced may only be selected when the price collector visits the outlet.

Either probability or nonprobability sampling techniques can be used. Probability sampling is the preferred option if detailed sampling frames that include data on size (for example, total sales) are available.

Samples should be reasonably optimized, based on at least a rudimentary analysis of sampling variance. Different options can be considered:

Best method would be to multiply each weight by a measure of price change dispersion in the group. Variance and cost considerations together call for allocations where relatively many products are measured per outlet and relatively few outlets are contained in the sample.

Second-best method would be to set the sample sizes approximately proportional to the weights of the product groups.

^{}1

For a full treatment of the subject, please refer to one of the many textbooks available, for example, Särndal and others (1992) or Cochran (1977).

^{}2

The order of the sampling stages can differ. For instance, varieties could be selected centrally and then outlets could be selected to price these varieties. Alternatively, outlets could be selected first, and the choice of the specific varieties to be priced is only taken when the price collector visits the outlet.