Managing Data Quality. Tim King
Читать онлайн книгу.a fast journey), whilst the same weather is bad for other people (the construction company trying to erect a new offshore wind farm). Similarly, if someone states that they have poor quality data, this can be difficult to interpret without a better way of describing the nature of the data; as such, it is useful to use appropriate characteristics to measure data quality.
These considerations lead to the need for more detail on the ‘fitness for purpose’ of data and data characteristics.
Fitness for purpose
In quality management, the term ‘quality’ is an assessment of whether an item or activity conforms with the requirements for it.
For example, a metal shaft used in the assembly of a machine is specified to have a diameter of 12.2 mm +/- 0.015 mm, along with many other requirements (e.g. length, material, surface finish, etc.). If one of these shafts was measured with a diameter of 12.196 mm, it would be deemed to have passed the quality test of assessing diameter. The shaft is a physical item that cannot easily serve another purpose.
for example, contains formatting information to ensure that the information is correctly displayed. There will, however, be little consistency between different documents (or messages), nor will it be easy to identify issues within the body of a document from a data quality perspective.
Sentiment analysis tools can be used to infer the general mood of a collection of messages based on identified key words and phrases. This, though, is not the same as assessing the quality of the data. From a data quality management perspective, the approaches defined in this book can easily be applied to the metadata of semi-structured data, but understanding the quality of the ‘body’ of documents and messages will be more challenging.
Managing Data Quality
12
In contrast, data in an enterprise context will often support multiple business processes. In such circumstances, an item of data will have to comply with multiple requirements simultaneously in order to be viewed as good quality data. For instance, the moment when an asset is formally commissioned needs to be known to the nearest year for long-term planning purposes, to the nearest week for maintenance planning purposes and to the nearest day for work management activities.
So, given that fitness for purpose is specified by a set of applicable requirements, the key consideration becomes identifying which characteristics of data are covered by those requirements.
Data characteristics
There have been various attempts to specify all the relevant quality characteristics of data but, in fact, none of these attempts covers a complete set of characteristics. Part of the problem is that different specialists describe data requirements from different perspectives.
The end user is mainly concerned with the ultimate effect of the data, so, for example, accuracy and completeness are key considerations.
The data modeller wants to know which attributes are mandatory for each entity (i.e. must contain a value in each data set) and which are optional.
The database administrator thinks about a data set as the tables and columns in the database. For each table, the administrator needs to know, for example, which columns are foreign keys and which column in which table contains the target of the foreign key.
These perspectives are brought together by ISO 8000-8, which builds on fundamental computer science to create a definitive overall framework for the characteristics and requirements of data. This framework identifies the three types of data quality as being:
syntactic (i.e. the correct format for the data);
semantic (i.e. the consistent common interpretation of the data);
pragmatic (i.e. the data will be useful to intended recipients).
These three types can appear to be abstract, so a more popular approach is to work with data quality dimensions. Again, many different lists exist of such dimensions and none is perfect, but we find this one most useful (DAMA UK 2013):
accuracy;
completeness;
consistency;
validity;
timeliness;
uniqueness.
The data asset
13
Table 1.1, using children’s toy bricks, illustrates how to use these data quality dimensions to identify appropriate requirements for data.
Table 1.1 An example data set
ID
Type
Length
Width
Height
Colour
Studs
Purchase Date
Cost
010
Wood
59.5
29.0
29.0
Yellow
-
012
Wood
59.5
28.9
28.9
N/A
01-09-2001
£8.42
014
Plastic
79.8
31.8
9.6
Black
10 × 4
015
Plastic
31.8
15.8
11.4
Blue
4 × 2
12-23-91
£2
044
Plastic
47.8
7.8
9.6
Grey
6 × 1
27/4/14
£7.12
045
Wood
60.0
29.5
28.6
Yellow
15/7/15
£4.21
Accuracy: Whether the data reflect the real object it represents. For example, looking at the records in Table 1.1, by inspecting the real object (the bricks) we can confirm that brick 045 is a yellow wooden block with the dimensions L 60 × W 29.5 × H 28.6. If the real object turns out to be a green brick or to have different dimensions from those in the table, then the data are inaccurate.
Completeness: Whether all relevant items are recorded and all their attributes are populated. For example, the attributes for brick 010 are not complete. Similarly, if the toy box contains a brick 017, the list of bricks is not complete.
Consistency: Whether an entity recorded in more than one data store is comparable across data stores. For example, brick 012 has a purchase date of 01-09-2001, but in the purchasing system the transaction date is 04-12-2001. If