Administrative Records for Survey Methodology. Группа авторов
Читать онлайн книгу.data at the individual level, the matter of linkage errors due to imperfect proxy identifiers can be relevant in many other situations.
In a frame that is constructed from combining multiple population datasets, one can often find several related classification variables. Identification errors (Zhang 2012) arise if the classification variables or the relationships between them are mistaken based on the input datasets. For instance, the variable address is central for population and household statistics. Multiple addresses can be collected by combining the Population Register with resident address, the Post Register with postal address, the Higher-Education Student Register with term-time address, the various Utility datasets with occupant address, etc. Each person may be assigned a unique de jure address based on all these sources, in a way that is judged to be most appropriate, which would then yield a proxy variable for the de facto address that is of interest in many social-economic statistics.
The economic activity classification, e.g. NACE in Europe, is a well-known example in business statistics. The NACE code in the Business Register is generally a proxy of the target “pure” economic classification that has its root in the System of National Accounts. Several issues contribute to this fact, such as inconsistent operational rules of the Business Register, misreporting, lack of updation, etc. It is common in sample surveys to observe that for some units the NACE code based on the updated survey returns will differ from the existing one in the Business Register. Such domain classification error is a kind of identification error. See, e.g. Brion and Gros (2015) for an example of how the matter is dealt with in the French Structural Business Surveys, and Van Delden, Scholtus, and Burger (2016) for an analysis of the NACE-classification errors in the Dutch context.
For survey data, the statistical unit can be identified in fieldwork. Based on register data, however, it is sometimes necessary to construct proxy statistical unit of interest, in which case unit errors may be unavoidable even if all the input data are error-free. For instance, consider register-based household. Provided all dwelling (or address) in the Population Register are correct, one may define a dwelling household to consist of all the persons who de jure share the same dwelling. We do not consider such a dwelling household to be a constructed statistical unit, precisely because it can be obtained from error-free input data directly. The perfection is another way of saying that there are no identification errors. An example of a constructed unit in this context is living household, which does not have to include everyone registered at the same dwelling nor be limited to these. Errors in a constructed living household is the case if two persons in different living households are placed in the same constructed living household, or if two persons in the same living households are placed in different constructed living households.
Constructed or not, unit error can be the case whether it results from lack of data or errors in data. Zhang (2011) devises a mathematical representation of unit error. It is assumed that each statistical unit of interest can consist of one or several so-called base units, but never cuts across a base unit. For example, person can be the base unit for household. The mapping from the set of base units to the set of statistical units can then be specified in terms of an allocation matrix, where each element takes value 1 or 0 depending on whether or not the corresponding base unit (arranged by column) belong to the statistical unit (arranged by row). In the case where a base unit can be assigned to one and only one statistical unit, such as a person can only belong to one household, the column sum of the allocation matrix is always equal to 1. Zhang (2011) develops a unit error theory for household statistics. Despite the unit error clearly being one of the most fundamental difficulties in business statistics, a statistical theory has so far been lacking. This may be partly due to the prominence of the identification error mentioned above. Another important reason may simply be the lack of a commonly acknowledged choice of base unit in business statistics.
1.2.2 Measurement
Consider now the measurement side in Figure 1.1. Relevance error refers to the discrepancy between the target measure that may be a theoretical construct and the measure that is achievable based on the available data. In a widespread scenario for combining register and survey data, the survey variable is treated as the target measure and the register proxy an auxiliary variable, which can be used either to adjust the survey sampling weights or to build a prediction model of the survey variable.
Sometimes, however, all the available measures entail relevance error, regardless of the source of the data, and there does not exist a way in which they can be combined to derive the target measure directly. For instance, Meijer, Rohwedder, and Wansbeek (2012) adopt such a viewpoint and study earnings data in register and survey using a mixture model approach, whereas Pavlopoulos and Vermunt (2015) apply latent class models to analyze income-based labor market mobility. It is also possible to formulate an adjusted measure as the solution of an appropriately defined constrained optimization problem, without explicitly introducing a model that spells out the relationship between the true measure and the observed proxy measures. For instance, Mushkudiani, Daalmans, and Pannekoek (2014) apply such an approach to Census aggregated tables and turnover variable from different sources.
Mapping error due to reclassification of input register data is highly common, since a register proxy variable often arises by means of reclassification. For instance, inferring the mother tongue from birth country is reclassification of the input variable birth country to the outcome variable mother tongue. For another example, to classify someone receiving unemployment benefit as unemployed is to reclassify the input variable benefit or not to the outcome variable unemployed or not. Examples as such are numerous.
It is worth noting that mapping error may be caused by delays or mistakes in the administrative sources, even where reclassification has no conceptual difficulties. Register data may be progressive in the sense that the observations for a particular reference time point may differ depending on when the observations are compiled. According to Zhang and Fosen (2012) and Zhang and Pritchard (2013), let t be the reference time point of interest and t + d the measurement time point, for d ≥ 0. Let U(t) and y(t) be the target population and value at t, respectively. For a unit i, let Ii(t; t + d) = 1 if it is to be included in the target population and 0 otherwise, based on the register data available at t + d, and let yi(t; t + d) be the observed value for t at t + d. The data are said to be progressive if, for d ≠ d ′ > 0, one can have Ii(t; t + d) ≠ Ii(t; t + d′) and yi(t; t + d) ≠ yi(t; t + d′). Progressiveness is a distinct feature of register data compared to survey data.
The observed proxy measures may need to be adjusted in order to satisfy micro- as well as macro-level constraints, so as to resolve incompatibility across the data sources. For instance, register data from corporate tax returns may be used to impute for the missing items in Structural Business Survey. If this results in numerical inconsistency with the items observed from the survey, then imputation or adjustment of some of the items will be necessary in order to produce a clean and coherent dataset. See e.g. Pannekoek, Shlomo, and DeWaal (2013) and Pannekoek and Zhang (2015) for relevant instances.
Imposing macro-level survey estimates as benchmarks, when micro-adjusting a register proxy variable, can be regarded as a means to achieve statistical relevance at the level where the unbiased benchmarks are introduced (Zhang and Giusti 2016), though one is unable to remove the relevance bias at the micro-level. The Norwegian register-based employment status provides an example of such uses of proxy variables. Initially, the register proxy variable is rule-processed based on several input administrative registers, covering employee benefit, self-employment, tax, military or civilian service, leave of absence, etc. This results in the tripartition of the target population: (I) the compatible part, where the register data are compatible across the sources and allow for unequivocal reclassification accordingly, (II) the resolved part, where reclassification can be determined after making room for administrative regulations and progressiveness of the data, (III) the unsolved part, where register data are either lacking or incompatible,