Data Lakes For Dummies. Alan R. Simon

Читать онлайн книгу.

Data Lakes For Dummies - Alan R. Simon


Скачать книгу
on structured data. Structured data is commonly stored in a relational database such as Microsoft SQL Server, MySQL, or Oracle Database.

Schematic illustration of the different types of data in data lake.

      FIGURE 1-3: Different types of data in your data lake.

       A data type, such as INTEGER, DECIMAL, CHARACTER, DATE, DATETIME, or something similar

       The size of the field, either explicitly declared (for example, how many characters a CHARACTER column will contain) or implicitly declared (the system-defined maximum number for an INTEGER or how a DATE column is structured)

       Any specific rules that apply to a data column or field, such as the permissible range of values (for example, a customer’s age must be between 18 and 130) or a list of allowable values (for example, an employee’s current status can only be FULL-TIME, PART-TIME, TERMINATED, or RETIRED)

       Any additional constraints, such as primary and foreign key designations, or referential integrity (rules that specify consistency for certain columns across multiple database tables)

      Unstructured data: A picture may be worth ten million words

      Unstructured data is, by definition, data that lacks a formally defined structure. Images (such as JPEGs), audio (such as MP3s), and videos (such as MP4s or MOVs) are common forms of unstructured data.

      Semi-structured data: Stuck in the middle of the lake

      Semi-structured data sort of falls in between structured and unstructured data. Examples include a blog post, a social media post, text messages, an email message, or a message from Slack or Microsoft Teams. Leaving aside any embedded or attached images or videos for a moment, all these examples consist of a long string of letters, numbers, and special characters. However, there’s no particular structure assigned to most of these text strings other than perhaps a couple of lines of heading information. The body of an email may be very short — only a line or two — while another email can go on for many long paragraphs.

      In your data lake, you need to have all these types of data sitting side by side. Why? Because you’ll be running analytics against the data lake that may need more than one form of data. For example, you receive and then analyze a detailed report of sales by department in a large department store during the past month.

      Then, after noticing a few anomalies in the sales numbers, you pull up in-store surveillance video to analyze traffic versus sales to better understand how many customers may be looking at merchandise but deciding not to make a purchase. You can even combine structured data from scanners with your unstructured video data as part of your analysis.

      If you had to go to different data storage environments for your sales results (structured data) and then the video surveillance (unstructured data), your overall analysis is dramatically slowed down, especially if you need to integrate and cross-reference different types of data. With a data lake, all this data is sitting side by side, ready to be delivered for analysis and decision-making.

      

In their earliest days, relational databases only stored structured data. Later, they were extended with capabilities to store structured and unstructured data. Binary large objects (BLOBs) were a common way to store images and even video in a relational database. However, even an object-extended relational database doesn’t make a good platform for a data lake when compared with modern data services such as Amazon S3 or Microsoft ADLS.

      Different water, different data

      A common misconception is that you store “all your data” in your data lake. Actually, you store all or most of your analytic data in a data lake. Analytic data is, as you may suspect from the name, data that you’re using for analytics. In contrast, you use operational data to run your business.

      But you also want to analyze that data, right? You want to understand which products are selling the best and where. You want to understand which customers are spending the most. You have dozens or even hundreds of questions you want to ask about your customers and their purchasing activity.

Here’s the catch: You need to make copies of your operational data for the deep analysis that you need to undertake; and the copies of that operational data are what goes into the data lake (see Figure 1-4).

Schematic illustration of the source applications feeding data into your data lake.

      FIGURE 1-4: Source applications feeding data into your data lake.

      Wait a minute! Why in the world do you need to copy data into your data lake? Why can’t you just analyze the data right where it is, in the source applications and their databases?

      Consequently, you need to make copies of the operational data that you want for analytical purposes and store that data in your data lake. Think of the data inside your data lake as (in used-car terminology) previously owned data that has been refurbished and is now ready for a brand-new owner.

      But if you can’t adequately do complex analytics directly from source applications and their databases, what about this idea: Run your applications off your data lake instead! This way, you can avoid having to copy your data, right? Unfortunately, that idea won’t work, at least with today’s technology.

      

Operational applications almost always use a relational database, which manages concurrency control among their users and applications. In simple terms, hundreds or even thousands of users can add new data and make changes to a relational database without interfering with each other’s work and corrupting the database. A data lake, however, is built on storage technology that is optimized for retrieving data for analysis and doesn’t support concurrency control for update operations.

      Many vendors are working on new technology that will allow you to build a data lake for operational, as well as analytical purposes. This technology is still a bit down the road from full operational viability. For the time being, you’ll build a data lake by copying data from many different source applications.

      Refilling the data lake

      What exactly does “copying


Скачать книгу