Frank Kane's Taming Big Data with Apache Spark and Python

Frank Kane's Taming Big Data with Apache Spark and Python
Author: Frank Kane
Publisher: Packt Publishing Ltd
Total Pages: 289
Release: 2017-06-30
Genre: Computers
ISBN: 1787288307

Frank Kane's hands-on Spark training course, based on his bestselling Taming Big Data with Apache Spark and Python video, now available in a book. Understand and analyze large data sets using Spark on a single system or on a cluster. About This Book Understand how Spark can be distributed across computing clusters Develop and run Spark jobs efficiently using Python A hands-on tutorial by Frank Kane with over 15 real-world examples teaching you Big Data processing with Spark Who This Book Is For If you are a data scientist or data analyst who wants to learn Big Data processing using Apache Spark and Python, this book is for you. If you have some programming experience in Python, and want to learn how to process large amounts of data using Apache Spark, Frank Kane's Taming Big Data with Apache Spark and Python will also help you. What You Will Learn Find out how you can identify Big Data problems as Spark problems Install and run Apache Spark on your computer or on a cluster Analyze large data sets across many CPUs using Spark's Resilient Distributed Datasets Implement machine learning on Spark using the MLlib library Process continuous streams of data in real time using the Spark streaming module Perform complex network analysis using Spark's GraphX library Use Amazon's Elastic MapReduce service to run your Spark jobs on a cluster In Detail Frank Kane's Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you'll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python. Apache Spark has emerged as the next big thing in the Big Data domain – quickly rising from an ascending technology to an established superstar in just a matter of years. Spark allows you to quickly extract actionable insights from large amounts of data, on a real-time basis, making it an essential tool in many modern businesses. Frank has packed this book with over 15 interactive, fun-filled examples relevant to the real world, and he will empower you to understand the Spark ecosystem and implement production-grade real-time Spark projects with ease. Style and approach Frank Kane's Taming Big Data with Apache Spark and Python is a hands-on tutorial with over 15 real-world examples carefully explained by Frank in a step-by-step manner. The examples vary in complexity, and you can move through them at your own pace.

Learning Spark

Learning Spark
Author: Holden Karau
Publisher: "O'Reilly Media, Inc."
Total Pages: 276
Release: 2015-01-28
Genre: Computers
ISBN: 144935906X

This book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. You'll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.--

Apache Spark Essential Training

Apache Spark Essential Training
Author:
Publisher:
Total Pages:
Release: 2017
Genre:
ISBN:

Get up to speed with Spark, and discover how to leverage this powerful platform to efficiently and effectively work with big data.

Learning Spark

Learning Spark
Author: Jules S. Damji
Publisher: O'Reilly Media
Total Pages: 400
Release: 2020-07-16
Genre: Computers
ISBN: 1492050016

Data is bigger, arrives faster, and comes in a variety of formats—and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark. Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, you’ll be able to: Learn Python, SQL, Scala, or Java high-level Structured APIs Understand Spark operations and SQL Engine Inspect, tune, and debug Spark operations with Spark configurations and Spark UI Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka Perform analytics on batch and streaming data using Structured Streaming Build reliable data pipelines with open source Delta Lake and Spark Develop machine learning pipelines with MLlib and productionize models using MLflow

Essential PySpark for Scalable Data Analytics

Essential PySpark for Scalable Data Analytics
Author: Sreeram Nudurupati
Publisher: Packt Publishing Ltd
Total Pages: 322
Release: 2021-10-29
Genre: Data mining
ISBN: 1800563094

Get started with distributed computing using PySpark, a single unified framework to solve end-to-end data analytics at scale Key FeaturesDiscover how to convert huge amounts of raw data into meaningful and actionable insightsUse Spark's unified analytics engine for end-to-end analytics, from data preparation to predictive analyticsPerform data ingestion, cleansing, and integration for ML, data analytics, and data visualizationBook Description Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework. Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas. By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems. What you will learnUnderstand the role of distributed computing in the world of big dataGain an appreciation for Apache Spark as the de facto go-to for big data processingScale out your data analytics process using Apache SparkBuild data pipelines using data lakes, and perform data visualization with PySpark and Spark SQLLeverage the cloud to build truly scalable and real-time data analytics applicationsExplore the applications of data science and scalable machine learning with PySparkIntegrate your clean and curated data with BI and SQL analysis toolsWho this book is for This book is for practicing data engineers, data scientists, data analysts, and data enthusiasts who are already using data analytics to explore distributed and scalable data analytics. Basic to intermediate knowledge of the disciplines of data engineering, data science, and SQL analytics is expected. General proficiency in using any programming language, especially Python, and working knowledge of performing data analytics using frameworks such as pandas and SQL will help you to get the most out of this book.

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Data Engineering with Apache Spark, Delta Lake, and Lakehouse
Author: Manoj Kukreja
Publisher: Packt Publishing Ltd
Total Pages: 480
Release: 2021-10-22
Genre: Computers
ISBN: 1801074321

Understand the complexities of modern-day data engineering platforms and explore strategies to deal with them with the help of use case scenarios led by an industry expert in big data Key FeaturesBecome well-versed with the core concepts of Apache Spark and Delta Lake for building data platformsLearn how to ingest, process, and analyze data that can be later used for training machine learning modelsUnderstand how to operationalize data models in production using curated dataBook Description In the world of ever-changing data and schemas, it is important to build data pipelines that can auto-adjust to changes. This book will help you build scalable data platforms that managers, data scientists, and data analysts can rely on. Starting with an introduction to data engineering, along with its key concepts and architectures, this book will show you how to use Microsoft Azure Cloud services effectively for data engineering. You'll cover data lake design patterns and the different stages through which the data needs to flow in a typical data lake. Once you've explored the main features of Delta Lake to build data lakes with fast performance and governance in mind, you'll advance to implementing the lambda architecture using Delta Lake. Packed with practical examples and code snippets, this book takes you through real-world examples based on production scenarios faced by the author in his 10 years of experience working with big data. Finally, you'll cover data lake deployment strategies that play an important role in provisioning the cloud resources and deploying the data pipelines in a repeatable and continuous way. By the end of this data engineering book, you'll know how to effectively deal with ever-changing data and create scalable data pipelines to streamline data science, ML, and artificial intelligence (AI) tasks. What you will learnDiscover the challenges you may face in the data engineering worldAdd ACID transactions to Apache Spark using Delta LakeUnderstand effective design strategies to build enterprise-grade data lakesExplore architectural and design patterns for building efficient data ingestion pipelinesOrchestrate a data pipeline for preprocessing data using Apache Spark and Delta Lake APIsAutomate deployment and monitoring of data pipelines in productionGet to grips with securing, monitoring, and managing data pipelines models efficientlyWho this book is for This book is for aspiring data engineers and data analysts who are new to the world of data engineering and are looking for a practical guide to building scalable data platforms. If you already work with PySpark and want to use Delta Lake for data engineering, you'll find this book useful. Basic knowledge of Python, Spark, and SQL is expected.

Databricks Certified Associate Developer for Apache Spark Using Python

Databricks Certified Associate Developer for Apache Spark Using Python
Author: Saba Shah
Publisher: Packt Publishing Ltd
Total Pages: 274
Release: 2024-06-14
Genre: Computers
ISBN: 1804616206

Learn the concepts and exercises needed to get certified as a Databricks Associate Developer for Apache Spark 3.0 and validate your skills as a Spark expert with an industry-recognized credential Key Features Understand the fundamentals of Apache Spark to help you design robust and fast Spark applications Delve into various data manipulation components for each phase of your data engineering project Prepare for the certification exam with sample questions and mock exams, and get closer to your goal Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionWith extensive data being collected every second, computing power cannot keep up with this pace of rapid growth. To make use of all the data, Spark has become a de facto standard for big data processing. Migrating data processing to Spark will not only help you save resources that will allow you to focus on your business, but also enable you to modernize your workloads by leveraging the capabilities of Spark and the modern technology stack for creating new business opportunities. This book is a comprehensive guide that lets you explore the core components of Apache Spark, its architecture, and its optimization. You’ll become familiar with the Spark dataframe API and its components needed for data manipulation. Next, you’ll find out what Spark streaming is and why it’s important for modern data stacks, before learning about machine learning in Spark and its different use cases. What’s more, you’ll discover sample questions at the end of each section along with two mock exams to help you prepare for the certification exam. By the end of this book, you’ll know what to expect in the exam and how to pass it with enough understanding of Spark and its tools. You’ll also be able to apply this knowledge in a real-world setting and take your skillset to the next level.What you will learn Create and manipulate SQL queries in Spark Build complex Spark functions using Spark UDFs Architect big data apps with Spark fundamentals for optimal design Apply techniques to manipulate and optimize big data applications Build real-time or near-real-time applications using Spark Streaming Work with Apache Spark for machine learning applications Who this book is for This book is for you if you’re a professional looking to venture into the world of big data and data engineering, a data professional who wants to endorse your knowledge of Spark, or a student. Although working knowledge of Python is required, no prior Spark knowledge is needed. Additionally, experience with Pyspark will be beneficial.