Hadoop and Spark$1500 USD
About this course
Big data has revolutionized technology and is showing no signs of slowing down. It is being leveraged by most organizations these days creating a large skills gap. Big data skills can be the bridge between your present career and your dream position. If you are looking to make the transition into Big data, the time is now.
Hadoop is an ecosystem with various tools and can be learnt by professionals from different backgrounds. Hadoop is highly recommended for Software Developers/Architects, Project Managers, ETL and Data Warehousing Professionals, DBAs and DB professionals and Analytics and Business Intelligence Professionals.
Number of jobs in the U.S. – 29,000+ (Source: LinkedIn)
U.S. National Average salary – $124,518/year (Source: ZipRecruiter)
This is a comprehensive course that covers Hadoop extensively starting from an introduction to the basics followed by a complete hands-on training on the Linux OS. The course will cover in-depth topics of HDFS, Map Reduce, SQL, Scala, Spark, Programming, KAFKA, and Spark on AWS.
Prerequisite – None required.
Introduction to Big data & Hadoop (1.5 hours)
• What is Big data? • Sources of Big data • Categories of Big data • Characteristics of Big data • Use-cases of Big data • Traditional RDBMS vs Hadoop • What is Hadoop? • History of Hadoop • Understanding Hadoop Architecture • Fundamental of HDFS (Blocks, Name Node, Data Node, Secondary Name Node) • Block Placement &Rack Awareness • HDFS Read/Write • Drawback with 1.X Hadoop • Introduction to 2.X Hadoop • High Availability
Linux (Complete Hands-on) (1 hour)
• Making/creating directories • Removing/deleting directories • Print working directory • Change directory • Manual pages • Help • Vi editor • Creating empty files • Creating file contents • Copying file • Renaming files • Removing files • Moving files • Listing files and directories • Displaying file contents
HDFS (1 hour)
• Understanding Hadoop configuration files • Hadoop Components- HDFS, MapReduce • Overview of Hadoop Processes • Overview of Hadoop Distributed File System • The building blocks of Hadoop • Hands-On Exercise: Using HDFS commands
Map Reduce (1.5 hours)
• Map Reduce 1(MRv1) o Map Reduce Introduction o How Map Reduce works? o Communication between Job Tracker and Task Tracker o Anatomy of a Map Reduce Job Submission • MapReduce-2(YARN) o Limitations of Current Architecture o YARN Architecture o Node Manager & Resource Manager
SQL (Complete Hands-on) (5 hours)
• DDL Commands o Create DB o Create table o Alter table o Drop table o Truncate table o Rename table • DML Commands o Insert command o Update command o Delete command • SQL Constraints o NOT NULL o UNIQUE o PRIMARY KEY o FOREIGN KEY o CHECK • Aggregate functions o AVG () o COUNT () o FIRST () o LAST () o MAX () o MIN () o SUM () • Scalar functions o LOWER () / LCASE () o UPPER () / UCASE () o MID () • Joins o Cross join o Inner join o Outer join o Left Outer join o Right Outer join • Views • Indexes
Scala (Complete Hands-on) (12 hours)
• Setup Java and JDK • Install Scala with IntelliJ IDE • Develop Hello World Program using Scala • Introduction to Scala • REPL Overview • Declaring Variables • Programming Constructs • Code Blocks • Scala Functions - Getting Started • Scala Functions - Higher Order and Anonymous Functions • Scala Functions - Operators • Object Oriented Constructs - Getting Started • Object Oriented Constructs - Objects • Object Oriented Constructs - Classes • Object Oriented Constructs - Companion Objects and Case Class • Operators and Functions on Classes • External Dependencies and Import • Scala Collections - Getting Started • Mutable and Immutable Collections • Sequence (Seq) - Getting Started • Linear Seq vs. Indexed Seq • Scala Collections - Primitive Operations • Scala Collections - Sorting Data • Scala Collections - Grouping Data • Scala Collections - Set • Scala Collections - Map • Tuples in Scala • Development Cycle - Developing Source code • Development Cycle - Compile source code to jar using SBT • Development Cycle - Setup SBT on Windows • Development Cycle - Compile changes and run jar with arguments • Development Cycle - Setup IntelliJ with Scala • Development Cycle - Develop Scala application using SBT in IntelliJ
Getting started with Spark (Complete Hands-on) (6 hours)
• What is Apache Spark & Why Spark? • Spark History • Unification in Spark • Spark ecosystem Vs Hadoop • Spark with Hadoop • Introduction to Spark’s Python and Scala Shells • Spark Standalone Cluster Architecture and its application flow
Programming with RDDS, DFs & DSs (Complete Hands-on) (12 hours)
• RDD Basics and its characteristics, Creating RDDs • RDD Operations • Transformations • Actions • RDD Types • Lazy Evaluation • Persistence (Caching) • Module-Advanced spark programming • Accumulators and Fault Tolerance • Broadcast Variables • Custom Partitioning • Dealing with different file formats • Hadoop Input and Output Formats • Connecting to diverse Data Sources • Module-Spark SQL • Linking with Spark SQL • Initializing Spark SQL • Data Frames &Caching • Case Classes, Inferred Schema • Loading and Saving Data • Apache Hive • Data Sources/Parquet • JSON • Spark SQL User Defined Functions (UDFs)
KAFKA & Spark Streaming (Complete Hands-on) (5 hours)
• Getting started with Kafka • Understanding Kafka Producer and Consumer APIs • Deep dive into producer and consumer APIs • Ingesting Web Server logs into Kafka • Getting started with Spark Streaming • Getting started with HBASE • Integrating Kafka-Spark Streaming-HBASE
Spark on Amazon Web Services (AWS)(Complete Hands-on) (5 hours)
• Introduction • Sign up for AWS account • Setup Cygwin on Windows • Quick Preview of Cygwin • Understand Pricing • Create first EC2 Instance • Connecting to EC2 Instance • Understanding EC2 dashboard left menu • Different EC2 Instance states • Describing EC2 Instance • Using elastic IPs to connect to EC2 Instance • Using security groups to provide security to EC2 Instance • Understanding the concept of bastion server • Terminating EC2 Instance and relieving all the resources • Create security credentials for AWS account • Setting up AWS CLI in Windows • Creating s3 bucket • Deleting root access keys • Enable MFA for root account • Introduction to IAM users and customizing sign in link • Create first IAM user • Create group and add user • Configure IAM password policy • Understanding IAM best practices • AWS managed policies and creating custom policies • Assign policy to entities (user and/or group) • Creating role for EC2 trusted entity with permissions on s3 • Assigning role to EC2 instance • Introduction to EMR • EMR concepts • Pre-requisites before setting up EMR cluster • Setting up data sets • Setup EMR with Spark cluster using quick options • Connecting to EMR cluster • Submitting spark job on EMR cluster • Validating the results • Terminating EMR Cluster