Official Google Cloud Certified Professional Data Engineer Study Guide. Dan Sullivan
Читать онлайн книгу.The data in the first row would be represented in a document database using a structure something like the following:
{ ’first_name’: ’Michael’, ’last_name’: ’Johnson’. ’street’_address’: ’334 Bay Rd’, ’city’: ’Santa Fe’, ’postal_code’: ’87501’ }
Since most rows in a table of names and addresses will have the same attributes, it is not necessary to use a data structure like a document structure. Consider the case of a product catalog that lists both appliances and furniture. Here is an example of how a dishwasher and a chair might be represented:
{ {’id’: ’123456’, ’product_type’: ’dishwasher’, ’length’: ’24 in’, ’width’: ’34 in’, ’weight’: ’175 lbs’, ’power’: ’1800 watts’ } {’id’:’987654’, ’product_type’: ’chair’, ’weight’: ’15 kg’, ’style’: ’modern’, ’color’: ’brown’ } }
In addition to document databases, wide-column databases, such as Bigtable and Cassandra, are also used with datasets with varying attributes.
Data Access Patterns
Data is accessed in different ways for different use cases. Some time-series data points may be read immediately after they are written, but they are not likely to be read once they are more than a day old. Customer order data may be read repeatedly as an order is processed. Archived data may be accessed less than once a year. Four metrics to consider about data access are as follows:
How much data is retrieved in a read operation?
How much data is written in an insert operation?
How often is data written?
How often is data read?
Some read and write operations apply to small amounts of data. Reading or writing a single piece of telemetry data is an example. Writing an e-commerce transaction may also entail a small amount of data. A database storing telemetry data from thousands of sensors that push data every five seconds will be writing large volumes, whereas an online transaction processing database for a small online retailer will also write small individual units of data but at a much smaller rate. These will require different kinds of databases. The telemetry data, for example, is better suited to Bigtable, with its low-latency writes, and the retailer transaction data is a good use case for Cloud SQL, with support for sufficient I/O operations to handle relational database loads.
Cloud Storage supports ingesting large volumes of data in bulk using tools such as the Cloud Transfer Service and Transfer Appliance. (Cloud Storage also supports streaming transfers, but bulk reads and writes are more common.) Data in Cloud Storage is read at the object or the file level. You typically don’t, for example, seek a particular block within a file as you can when storing a file on a filesystem.
It is common to read large volumes of data in BigQuery as well; however, in that case we often read a small number of columns across a large number of rows. BigQuery optimizes for these kinds of reads by using a columnar storage format known as Capacitor. Capacitor is designed to store semi-structured data with nested and repeated fields.
Data access patterns can help identify the best storage technology for a use case by highlighting key features needed to support those access patterns.
Security Requirements
Different storage systems will have different levels of access controls. Cloud Storage, for example, can have access controls at the bucket and the object level. If someone has access to a file in Cloud Storage, they will have access to all the data in that file. If some users have access only to a subset of a dataset, then the data could be stored in a relational database and a view could be created that includes only the data that the user is allowed to access.
Encrypting data at rest is an important requirement for many use cases; fortunately, all Google Cloud storage services encrypt data at rest.
When choosing a storage technology, the ability to control access to data is a key consideration.
Types of Structure: Structured, Semi-Structured, and Unstructured
For the purposes of choosing a storage technology, it is helpful to consider how data is structured. There are three widely recognized categories:
Structured
Semi-structured
Unstructured
These categories are particularly helpful when choosing a database.
Structured Data Structured data has a fixed set of attributes that can be modeled in a table of rows and columns.
Semi-Structured Data Semi-structured data has attributes like structured data, but the set of attributes can vary from one instance to another. For example, a product description of an appliance might include length, width, height, weight, and power consumption. A chair in the same catalog might have length, width, height, color, and style as attributes. Semi-structured data may be organized using arrays or sets of key-value pairs.
Unstructured Data Unstructured data does not fit into a tabular structure. Images and audio files are good examples of unstructured data. In between these two extremes lies semi-structured data, which has characteristics of both structured and unstructured.
Structured: Transactional vs. Analytical
Structured data can be represented in tables of columns and rows, where columns are attributes and rows are records or entities. Table 1.1 showed an example of structured data. Structured data may be oriented to transactional processes or analytical use cases.
Transactional structured data is often operated on one row at a time. For example, a business application may look up a customer’s account information from a customer table when displaying data about the customer’s shipping address. Multiple columns from a single row will be used, so it is efficient to store all row attributes together in a data block. Retrieving a single data block will retrieve all the needed data. This is a common pattern in transactional databases such as Cloud SQL and Cloud Spanner, which use row-oriented storage.
Now consider a data warehousing example. A business analyst is working with a sales data mart and wants to understand how sales last month compare to the same period last year. The data mart has one row for each product on each date, which include the following attributes in addition to product and date: number of units sold, total revenue for units sold, average unit price, average marginal revenue, and total marginal revenue. The analyst is only interested in the monthly sums of total revenue for units sold for each product. In this case, the analyst would query many rows and only three columns. Instead of retrieving the full row for all rows selected, it is more efficient to retrieve only the date, product, and total revenue of units sold columns. This is a common pattern in analytical applications and the reason why BigQuery uses a column-oriented storage mechanism.
Semi-Structured: Fully Indexed vs. Row Key Access
Semi-structured data, as noted earlier, does not follow a fixed tabular format and instead stores schema attributes along with the data. In the case of document databases, this allows developers to add attributes as needed