Data Lakehouse in Action

Data Lakehouse in Action
Author: Pradeep Menon
Publisher: Packt Publishing Ltd
Total Pages: 206
Release: 2022-03-17
Genre: Computers
ISBN: 1801815100

Propose a new scalable data architecture paradigm, Data Lakehouse, that addresses the limitations of current data architecture patterns Key FeaturesUnderstand how data is ingested, stored, served, governed, and secured for enabling data analyticsExplore a practical way to implement Data Lakehouse using cloud computing platforms like AzureCombine multiple architectural patterns based on an organization's needs and maturity levelBook Description The Data Lakehouse architecture is a new paradigm that enables large-scale analytics. This book will guide you in developing data architecture in the right way to ensure your organization's success. The first part of the book discusses the different data architectural patterns used in the past and the need for a new architectural paradigm, as well as the drivers that have caused this change. It covers the principles that govern the target architecture, the components that form the Data Lakehouse architecture, and the rationale and need for those components. The second part deep dives into the different layers of Data Lakehouse. It covers various scenarios and components for data ingestion, storage, data processing, data serving, analytics, governance, and data security. The book's third part focuses on the practical implementation of the Data Lakehouse architecture in a cloud computing platform. It focuses on various ways to combine the Data Lakehouse pattern to realize macro-patterns, such as Data Mesh and Data Hub-Spoke, based on the organization's needs and maturity level. The frameworks introduced will be practical and organizations can readily benefit from their application. By the end of this book, you'll clearly understand how to implement the Data Lakehouse architecture pattern in a scalable, agile, and cost-effective manner. What you will learnUnderstand the evolution of the Data Architecture patterns for analyticsBecome well versed in the Data Lakehouse pattern and how it enables data analyticsFocus on methods to ingest, process, store, and govern data in a Data Lakehouse architectureLearn techniques to serve data and perform analytics in a Data Lakehouse architectureCover methods to secure the data in a Data Lakehouse architectureImplement Data Lakehouse in a cloud computing platform such as AzureCombine Data Lakehouse in a macro-architecture pattern such as Data MeshWho this book is for This book is for data architects, big data engineers, data strategists and practitioners, data stewards, and cloud computing practitioners looking to become well-versed with modern data architecture patterns to enable large-scale analytics. Basic knowledge of data architecture and familiarity with data warehousing concepts are required.

Databricks ML in Action

Databricks ML in Action
Author: Stephanie Rivera
Publisher: Packt Publishing Ltd
Total Pages: 280
Release: 2024-05-17
Genre: Computers
ISBN: 1800564007

Get to grips with autogenerating code, deploying ML algorithms, and leveraging various ML lifecycle features on the Databricks Platform, guided by best practices and reusable code for you to try, alter, and build on Key Features Build machine learning solutions faster than peers only using documentation Enhance or refine your expertise with tribal knowledge and concise explanations Follow along with code projects provided in GitHub to accelerate your projects Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionDiscover what makes the Databricks Data Intelligence Platform the go-to choice for top-tier machine learning solutions. Written by a team of industry experts at Databricks with decades of combined experience in big data, machine learning, and data science, Databricks ML in Action presents cloud-agnostic, end-to-end examples with hands-on illustrations of executing data science, machine learning, and generative AI projects on the Databricks Platform. You’ll develop expertise in Databricks' managed MLflow, Vector Search, AutoML, Unity Catalog, and Model Serving as you learn to apply them practically in everyday workflows. This Databricks book not only offers detailed code explanations but also facilitates seamless code importation for practical use. You’ll discover how to leverage the open-source Databricks platform to enhance learning, boost skills, and elevate productivity with supplemental resources. By the end of this book, you'll have mastered the use of Databricks for data science, machine learning, and generative AI, enabling you to deliver outstanding data products.What you will learn Set up a workspace for a data team planning to perform data science Monitor data quality and detect drift Use autogenerated code for ML modeling and data exploration Operationalize ML with feature engineering client, AutoML, VectorSearch, Delta Live Tables, AutoLoader, and Workflows Integrate open-source and third-party applications, such as OpenAI's ChatGPT, into your AI projects Communicate insights through Databricks SQL dashboards and Delta Sharing Explore data and models through the Databricks marketplace Who this book is for This book is for machine learning engineers, data scientists, and technical managers seeking hands-on expertise in implementing and leveraging the Databricks Data Intelligence Platform and its Lakehouse architecture to create data products.

Building the Data Lakehouse

Building the Data Lakehouse
Author: Bill Inmon
Publisher: Technics Publications
Total Pages: 256
Release: 2021-10
Genre:
ISBN: 9781634629669

The data lakehouse is the next generation of the data warehouse and data lake, designed to meet today's complex and ever-changing analytics, machine learning, and data science requirements. Learn about the features and architecture of the data lakehouse, along with its powerful analytical infrastructure. Appreciate how the universal common connector blends structured, textual, analog, and IoT data. Maintain the lakehouse for future generations through Data Lakehouse Housekeeping and Data Future-proofing. Know how to incorporate the lakehouse into an existing data governance strategy. Incorporate data catalogs, data lineage tools, and open source software into your architecture to ensure your data scientists, analysts, and end users live happily ever after.

Spark in Action

Spark in Action
Author: Jean-Georges Perrin
Publisher: Simon and Schuster
Total Pages: 574
Release: 2020-05-12
Genre: Computers
ISBN: 1638351309

Summary The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop. Foreword by Rob Thomas. About the technology Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem. About the book Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms. What's inside Writing Spark applications in Java Spark application architecture Ingestion through files, databases, streaming, and Elasticsearch Querying distributed datasets with Spark SQL About the reader This book does not assume previous experience with Spark, Scala, or Hadoop. About the author Jean-Georges Perrin is an experienced data and software architect. He is France’s first IBM Champion and has been honored for 12 consecutive years. Table of Contents PART 1 - THE THEORY CRIPPLED BY AWESOME EXAMPLES 1 So, what is Spark, anyway? 2 Architecture and flow 3 The majestic role of the dataframe 4 Fundamentally lazy 5 Building a simple app for deployment 6 Deploying your simple app PART 2 - INGESTION 7 Ingestion from files 8 Ingestion from databases 9 Advanced ingestion: finding data sources and building your own 10 Ingestion through structured streaming PART 3 - TRANSFORMING YOUR DATA 11 Working with SQL 12 Transforming your data 13 Transforming entire documents 14 Extending transformations with user-defined functions 15 Aggregating your data PART 4 - GOING FURTHER 16 Cache and checkpoint: Enhancing Spark’s performances 17 Exporting data and building full data pipelines 18 Exploring deployment

The Enterprise Big Data Lake

The Enterprise Big Data Lake
Author: Alex Gorelik
Publisher: "O'Reilly Media, Inc."
Total Pages: 224
Release: 2019-02-21
Genre: Computers
ISBN: 1491931507

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities. But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises. You’ll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book. Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise. Then, in a collection of essays about data lake implementation, you’ll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries. Get a succinct introduction to data warehousing, big data, and data science Learn various paths enterprises take to build a data lake Explore how to build a self-service model and best practices for providing analysts access to the data Use different methods for architecting your data lake Discover ways to implement a data lake from experts in different industries

Apache Pulsar in Action

Apache Pulsar in Action
Author: David Kjerrumgaard
Publisher: Simon and Schuster
Total Pages: 398
Release: 2021-12-14
Genre: Computers
ISBN: 1617296880

Distributed applications demand reliable, high-performance messaging. The Apache Pulsar server-to-server messaging system provides a secure, stable platform without the need for a stream processing engine like Spark. Contributed by Yahoo to the Apache Foundation, Pulsar is mature and battle-tested, handling millions of messages per second for over three years at Yahoo. Apache Pulsar in Action is a comprehensive and practical guide to building high-traffic applications with Pulsar, delivering extreme levels of speed and durability. about the technology Pulsar is a streaming messaging system designed for high performance server-to-server messaging. Built and tested under intense conditions at Yahoo, Pulsar has been proven in production and can handle millions of messages per second. Now free and open-source, Pulsar''s unique architecture helps solve some of the challenges of modern development. Pulsar avoids latency in streaming data transmission, making it a powerful tool for IoT Edge analytics. Its unified messaging model improves the performance of microservices architecture, and its tiered storage capabilities allow for larger volumes of data to be handled without fear of data loss. Pulsar''s flexible API interface works with Java, C++, Python, and Go, making it easy to incorporate Pulsar into your stack. about the book Apache Pulsar in Action is a hands-on guide to building scalable streaming messaging systems for distributed applications and microservices systems. You''ll start with Pulsar''s fundamentals, each illustrated by real-world examples, as you get to grips with Pulsar''s unique architecture. Pulsar contributor David Kjerrumgaard teaches the skills you need to deploy a Pulsar server, ingest data from third-party systems, and deploy lightweight computing logic with simple functions. You''ll learn to employ Pulsar''s seamless scalability through relatable case studies, including an IOT analytics application that can be deployed within a resource constrained environment and a microservices application based on Pulsar functions. At the end of this practical book, you''ll be ready to fully take advantage of Pulsar to create high-traffic message-driven applications. what''s inside Publish from Apache Pulsar into third-party data repositories and platforms Design and develop Apache Pulsar functions Perform interactive SQL queries against data stored in Apache Pulsar Examples of Pulsar-based microservices that you can download and try yourself about the reader Written for experienced Java developers. No prior knowledge of Pulsar is needed. about the author David Kjerrumgaard is the Director of Solution Architecture at Streamlio, and a contributor to the Apache Pulsar and Apache NiFi projects.

Data Lake for Enterprises

Data Lake for Enterprises
Author: Tomcy John
Publisher: Packt Publishing Ltd
Total Pages: 585
Release: 2017-05-31
Genre: Computers
ISBN: 1787282651

A practical guide to implementing your enterprise data lake using Lambda Architecture as the base About This Book Build a full-fledged data lake for your organization with popular big data technologies using the Lambda architecture as the base Delve into the big data technologies required to meet modern day business strategies A highly practical guide to implementing enterprise data lakes with lots of examples and real-world use-cases Who This Book Is For Java developers and architects who would like to implement a data lake for their enterprise will find this book useful. If you want to get hands-on experience with the Lambda Architecture and big data technologies by implementing a practical solution using these technologies, this book will also help you. What You Will Learn Build an enterprise-level data lake using the relevant big data technologies Understand the core of the Lambda architecture and how to apply it in an enterprise Learn the technical details around Sqoop and its functionalities Integrate Kafka with Hadoop components to acquire enterprise data Use flume with streaming technologies for stream-based processing Understand stream- based processing with reference to Apache Spark Streaming Incorporate Hadoop components and know the advantages they provide for enterprise data lakes Build fast, streaming, and high-performance applications using ElasticSearch Make your data ingestion process consistent across various data formats with configurability Process your data to derive intelligence using machine learning algorithms In Detail The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake. Style and approach The book takes a pragmatic approach, showing ways to leverage big data technologies and lambda architecture to build an enterprise-level data lake.

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Data Engineering with Apache Spark, Delta Lake, and Lakehouse
Author: Manoj Kukreja
Publisher: Packt Publishing Ltd
Total Pages: 480
Release: 2021-10-22
Genre: Computers
ISBN: 1801074321

Understand the complexities of modern-day data engineering platforms and explore strategies to deal with them with the help of use case scenarios led by an industry expert in big data Key FeaturesBecome well-versed with the core concepts of Apache Spark and Delta Lake for building data platformsLearn how to ingest, process, and analyze data that can be later used for training machine learning modelsUnderstand how to operationalize data models in production using curated dataBook Description In the world of ever-changing data and schemas, it is important to build data pipelines that can auto-adjust to changes. This book will help you build scalable data platforms that managers, data scientists, and data analysts can rely on. Starting with an introduction to data engineering, along with its key concepts and architectures, this book will show you how to use Microsoft Azure Cloud services effectively for data engineering. You'll cover data lake design patterns and the different stages through which the data needs to flow in a typical data lake. Once you've explored the main features of Delta Lake to build data lakes with fast performance and governance in mind, you'll advance to implementing the lambda architecture using Delta Lake. Packed with practical examples and code snippets, this book takes you through real-world examples based on production scenarios faced by the author in his 10 years of experience working with big data. Finally, you'll cover data lake deployment strategies that play an important role in provisioning the cloud resources and deploying the data pipelines in a repeatable and continuous way. By the end of this data engineering book, you'll know how to effectively deal with ever-changing data and create scalable data pipelines to streamline data science, ML, and artificial intelligence (AI) tasks. What you will learnDiscover the challenges you may face in the data engineering worldAdd ACID transactions to Apache Spark using Delta LakeUnderstand effective design strategies to build enterprise-grade data lakesExplore architectural and design patterns for building efficient data ingestion pipelinesOrchestrate a data pipeline for preprocessing data using Apache Spark and Delta Lake APIsAutomate deployment and monitoring of data pipelines in productionGet to grips with securing, monitoring, and managing data pipelines models efficientlyWho this book is for This book is for aspiring data engineers and data analysts who are new to the world of data engineering and are looking for a practical guide to building scalable data platforms. If you already work with PySpark and want to use Delta Lake for data engineering, you'll find this book useful. Basic knowledge of Python, Spark, and SQL is expected.

Data Mesh in Action

Data Mesh in Action
Author: Jacek Majchrzak
Publisher: Simon and Schuster
Total Pages: 326
Release: 2023-03-21
Genre: Computers
ISBN: 1638351848

Revolutionize the way your organization approaches data with a data mesh! This new decentralized architecture outpaces monolithic lakes and warehouses and can work for a company of any size. In Data Mesh in Action you will learn how to: Implement a data mesh in your organization Turn data into a data product Move from your current data architecture to a data mesh Identify data domains, and decompose an organization into smaller, manageable domains Set up the central governance and local governance levels over data Balance responsibilities between the two levels of governance Establish a platform that allows efficient connection of distributed data products and automated governance Data Mesh in Action reveals how this groundbreaking architecture looks for both small startups and large enterprises. You won’t need any new technology—this book shows you how to start implementing a data mesh with flexible processes and organizational change. You’ll explore both an extended case study and multiple real-world examples. As you go, you’ll be expertly guided through discussions around Socio-Technical Architecture and Domain-Driven Design with the goal of building a sleek data-as-a-product system. Plus, dozens of workshop techniques for both in-person and remote meetings help you onboard colleagues and drive a successful transition. About the technology Business increasingly relies on efficiently storing and accessing large volumes of data. The data mesh is a new way to decentralize data management that radically improves security and discoverability. A well-designed data mesh simplifies self-service data consumption and reduces the bottlenecks created by monolithic data architectures. About the book Data Mesh in Action teaches you pragmatic ways to decentralize your data and organize it into an effective data mesh. You’ll start by building a minimum viable data product, which you’ll expand into a self-service data platform, chapter-by-chapter. You’ll love the book’s unique “sliders” that adjust the mesh to meet your specific needs. You’ll also learn processes and leadership techniques that will change the way you and your colleagues think about data. What's inside Decompose an organization into manageable domains Turn data into a data product Set up central and local governance levels Build a fit-for-purpose data platform Improve management, initiation, and support techniques About the reader For data professionals. Requires no specific programming stack or data platform. About the author Jacek Majchrzak is a hands-on lead data architect. Dr. Sven Balnojan manages data products and teams. Dr. Marian Siwiak is a data scientist and a management consultant for IT, scientific, and technical projects. Table of Contents PART 1 FOUNDATIONS 1 The what and why of the data mesh 2 Is a data mesh right for you? 3 Kickstart your data mesh MVP in a month PART 2 THE FOUR PRINCIPLES IN PRACTICE 4 Domain ownership 5 Data as a product 6 Federated computational governance 7 The self-serve data platform PART 3 INFRASTRUCTURE AND TECHNICAL ARCHITECTURE 8 Comparing self-serve data platforms 9 Solution architecture design

Data Mesh

Data Mesh
Author: Zhamak Dehghani
Publisher: "O'Reilly Media, Inc."
Total Pages: 387
Release: 2022-03-08
Genre: Computers
ISBN: 1492092363

Many enterprises are investing in a next-generation data lake, hoping to democratize data at scale to provide business insights and ultimately make automated intelligent decisions. In this practical book, author Zhamak Dehghani reveals that, despite the time, money, and effort poured into them, data warehouses and data lakes fail when applied at the scale and speed of today's organizations. A distributed data mesh is a better choice. Dehghani guides architects, technical leaders, and decision makers on their journey from monolithic big data architecture to a sociotechnical paradigm that draws from modern distributed architecture. A data mesh considers domains as a first-class concern, applies platform thinking to create self-serve data infrastructure, treats data as a product, and introduces a federated and computational model of data governance. This book shows you why and how. Examine the current data landscape from the perspective of business and organizational needs, environmental challenges, and existing architectures Analyze the landscape's underlying characteristics and failure modes Get a complete introduction to data mesh principles and its constituents Learn how to design a data mesh architecture Move beyond a monolithic data lake to a distributed data mesh.