Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He works for Cloudera, a company set up to offer Hadoop support and training. Previously he was as an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O'Reilly, java.net and IBM's developerWorks, and has spoken at several conferences, including at ApacheCon 2008 on Hadoop. Tom has a Bachelor's degree in Mathematics from the University of Cambridge and a Master's in Philosophy of Science from the University of Leeds, UK.
Rather than run through all possible scenarios, this pragmatic operations guide calls out what works, as demonstrated in critical deployments.Get a high-level overview of HDFS and MapReduce: why they exist and how they workPlan a Hadoop deployment, from hardware and OS selection to network requirementsLearn setup and configuration details with a list of critical propertiesManage resources by sharing a cluster across multiple groupsGet a runbook of the most common cluster maintenance tasksMonitor Hadoop clusters—and learn troubleshooting with the help of real-world war storiesUse basic tools and techniques to handle backup and catastrophic failure
This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. You’ll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data.Use Hive to create, alter, and drop databases, tables, views, functions, and indexesCustomize data formats and storage options, from files to external databasesLoad and extract data from tables—and use queries, grouping, filtering, joining, and other conventional query methodsGain best practices for creating user defined functions (UDFs)Learn Hive patterns you should use and anti-patterns you should avoidIntegrate Hive with other data processing programsUse storage handlers for NoSQL databases and other datastoresLearn the pros and cons of running Hive on Amazon’s Elastic MapReduce
To reinforce those lessons, the book’s second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether you’re designing a new Hadoop application, or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.
This book covers:Factors to consider when using Hadoop to store and model dataBest practices for moving data in and out of the systemData processing frameworks, including MapReduce, Spark, and HiveCommon Hadoop processing patterns, such as removing duplicate records and using windowing analyticsGiraph, GraphX, and other tools for large graph processing on HadoopUsing workflow orchestration and scheduling tools such as Apache OozieNear-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache FlumeArchitecture examples for clickstream analysis, fraud detection, and data warehousing