Data Science For Dummies. Lillian Pierson
Читать онлайн книгу.its data infrastructure than Google, Amazon, or Microsoft.
A lot of different technologies have emerged in the wake of the cloud computing revolution, many of which are of interest to those trying to leverage big data. The next sections examine a few of these new technologies.
Using serverless computing to execute data science
When we talk about serverless computing, the term serverless is quite misleading because the computing indeed takes place on a server. Serverless computing really refers to computing that’s executed in a cloud environment rather than on your desktop or on-premise at your company. The physical host server exists, but it's 100 percent supported by the cloud computing provider retained by you or your company.
One great tragedy of modern-day data science is the amount of time data scientists spend on non-mission-critical tasks like data collection, data cleaning and reformatting, data operations, and data integration. By most estimates, only 10 percent of a data scientist's time is spent on predictive model building — the rest of it is spent trying to prepare the data and the data infrastructure for that mission-critical task they’ve been retained to complete. Serverless computing has been a game-changer for the data science industry because it decreases the down-time that data scientists spend in preparing data and infrastructure for their predictive models.
Earlier in this chapter, I talk a bit about SaaS. Serverless computing offers something similar, but this is Function as a Service (FaaS) — a containerized cloud computing service that makes it much faster and simpler to execute code and predictive functions directly in a cloud environment, without the need to set up complicated infrastructure around that code. With serverless computing, your data science model runs directly within its container, as a sort of stand-alone function. Your cloud service provider handles all the provisioning and adjustments that need to be made to the infrastructure to support your functions.
Examples of popular serverless computing solutions are AWS Lambda, Google Cloud Functions, and Azure Functions.
Containerizing predictive applications within Kubernetes
Kubernetes is an open-source software suite that manages, orchestrates, and coordinates the deployment, scaling, and management of containerized applications across clusters of worker nodes. One particularly attractive feature about Kubernetes is that you can run it on data that sits in on-premise clusters, in the cloud, or in a hybrid cloud environment.
Kubernetes’ chief focus is helping software developers build and scale apps quickly. Though it does provide a fault-tolerant, extensible environment for deploying and scaling predictive applications in the cloud, it also requires quite a bit of data engineering expertise to set them up correctly.
A system is fault tolerant if it is built to continue successful operations despite the failure of one or more of its subcomponents. This requires redundancy in computing nodes. A system is described as extensible if it is flexible enough to be extended or shrunk in size without disrupting its operations.
To overcome this obstacle, Kubernetes released its KubeFlow product, a machine learning toolkit that makes it simple for data scientists to directly deploy predictive models within Kubernetes containers, without the need for outside data engineering support.
Sizing up popular cloud-warehouse solutions
You have a number of products to choose from when it comes to cloud-warehouse solutions. The following list looks at the most popular options:
Amazon Redshift: A popular big data warehousing service that runs atop data sitting within the Amazon Cloud, it is most notable for the incredible speed at which it can handle data analytics and business intelligence workloads. Because it runs on the AWS platform, Redshift’s fully managed data warehousing service has the incredible capacity to support petabyte-scale cloud storage requirements. If your company is already using other AWS services — like Amazon EMR, Amazon Athena, or Amazon Kinesis — Redshift is the natural choice to integrate nicely with your existing technology. Redshift offers both pay-as-you-go as well as on-demand pricing structures that you’ll want to explore further on its website: https://aws.amazon.com/redshift
Parallel processing refers to a powerful framework where data is processed very quickly because the work required to process the data is distributed across multiple nodes in a system. This configuration allows for the simultaneous processing of multiple tasks across different nodes in the system.
Snowflake: This SaaS solution provides powerful, parallel-processing analytics capabilities for both structured and semistructured data stored in the cloud on Snowflake’s servers. Snowflake provides the ultimate 3-in-1 with its cost-effective big data storage, analytical processing capabilities, and all the built-in cloud services you might need. Snowflake integrates well with analytics tools like Tableau and Qlik, as well as with traditional big data technologies like Apache Spark, Pentaho, and Apache Kafka, but it wouldn’t make sense if you’re already relying mostly on Amazon services. Pricing for the Snowflake service is based on the amount of data you store as well as on the execution time for compute resources you consume on the platform.
Google BigQuery: Touted as a serverless data warehouse solution, BigQuery is a relatively cost-effective solution for generating analytics from big data sources stored in the Google Cloud. Similar to Snowflake and Redshift, BigQuery provides fully managed cloud services that make it fast and simple for data scientists and analytics professionals to use the tool without the need for assistance from in-house data engineers. Analytics can be generated on petabyte-scale data. BigQuery integrates with Google Data Studio, Power BI, Looker, and Tableau for ease of use when it comes to post-analysis data storytelling. Pricing for Google BigQuery is based on the amount of data you store as well as on the compute resources you consume on the platform, as represented by the amount of data your queries return from the platform.
Introducing NoSQL databases
A traditional RDBMS isn’t equipped to handle big data demands. That’s because it’s designed to handle only relational datasets constructed of data that’s stored in clean rows and columns and thus is capable of being queried via SQL. RDBMSs are incapable of handling unstructured and semistructured data. Moreover, RDBMSs simply lack the processing and handling capabilities that are needed for meeting big data volume-and-velocity requirements.
This is where NoSQL comes in — its databases are nonrelational, distributed database systems that were designed to rise to the challenges involved in storing and processing big data. They can be run on-premise or in a cloud environment. NoSQL databases step out past the traditional relational database architecture and offer a much more scalable, efficient solution. NoSQL systems facilitate non-SQL data querying of nonrelational or schema-free, semistructured and unstructured data. In this way, NoSQL databases are able to handle the structured, semistructured, and unstructured data sources that are common in big data systems.
A key-value pair is a pair of data items, represented by a key and a value. The key is a data item that acts as the record identifier and the value is the data that’s identified (and retrieved) by its respective key.
NoSQL offers four categories of nonrelational databases: graph databases, document databases, key-values stores, and column family stores. Because NoSQL offers native functionality for each of these separate types of data structures, it offers efficient storage and retrieval functionality for most types of nonrelational data. This adaptability and efficiency make NoSQL an increasingly popular choice for handling big data and for overcoming processing challenges that come along with it.
NoSQL applications like Apache Cassandra and MongoDB are used for data storage and real-time processing.