Spark / R Programming for Data Scientists and Analysts
Spark / R Programming for Data Scientists and Analysts Course Details:
Spark is a highly optimized Data Science environment running on Hadoop YARN, with support for Machine Learning through MLib and Mahout, SQL, DataFrames, and Streaming. In this course, you’ll dive into the details of practical data science on the Spark platform, including real-world interaction with other systems in modern Data Science environments.
Call (919) 283-1674 to get a class scheduled online or in your area!
Getting Started - Overview
- Our Data and our problem set
- Accessing the cluster, the data, and the tools
- The Continuous Workshop approach
- "Let's build a model together"
- Focus on analysis, exploration, data munging, algorithms
- Tooling and fundamentals as necessary to get the job done
Spark Introduction
- Data Science: The State of the Art
- Hadoop, Yarn, and Spark
- Architectural Overview
- MLib Overview
- HDFS data - Accessing
- Lab Focus
- Working with HDFS data
- Distributed vs. Local Run Modes
- Spark vs. Other tools (when is Spark the right tool for the job?)
- Spark vs. SAS
- Spark Languages (Java, R, Python, and Scala)
- Hello, Spark
Spark Overview
- Spark Core
- Spark SQL
- Spark and Hive
- Lab
- MLib
- Spark Streaming
- Spark API
DataFrames
- DataFrames and Resilient Distributed Datasets (RDDs)
- Partitions
- Adding variables to a DataFrame
- DataFrame Types
- DataFrame Operations
- Dependent vs. Independent variables
- Map/Reduce with DataFrames
Spark SQL
- Spark SQL Overview
- Data stores: HDFS, Cassandra, HBase, Hive, and S3
- Table Definitions
- Queries
Spark MLib
- MLib overview
- MLib Algorithms Overview
- Classification Algorithms
- Regression Algorithms
- Lab Focus
- Brief Comparison to SAS
- Here's your split, how to tune regression
- Decision Trees and forests
- Lab Focus
- Brief Comparison to SAS
- Stepwise approach to Decision Trees
- Working with Exit Criteria
- Recommendation with ALS
- Clustering Algorithms
- Lab Focus
- Key Clustering Algorithms
- Choosing Clustering Algorithms
- Working with key algorithms
- Machine Learning Pipelines
- Linear Algebra (SVD, PCA)
- Statistics in MLib
Spark Streaming
- Streaming overview
Streaming with Kafka
- Kafka overview
- Kafka and Spark Streaming
Data Flow with NiFi
- Apache NiFi overview
- NiFi data flows with Spark/R
Cluster Mode
- Standalone Cluster
- Masters and Workers
Spark - the Big Picture
- Spark in Real-Time and near-Real-Time Decision Support Systems
- Spark in the Enterprise
- Best Practices
*Please Note: Course Outline is subject to change without notice. Exact course outline will be provided at time of registration.
Join an engaging hands-on learning environment, where you’ll learn:
- The essentials of Spark architecture and applications
- How to execute Spark Programs
- How to create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
- How to integrate machine learning into Spark applications
- How to use Spark Streaming
This course has a 40% hands-on labs to 60% lecture ratio with engaging instruction, demos, group discussions, labs, and project work.
This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core R programming and data analytics skills, coupling the most current, effective techniques with the soundest industry practices. Throughout the course students will be led through a series of progressively advanced topics, where each topic consists of lecture, group discussion, comprehensive hands-on lab exercises, and lab review.
Before attending this course, you should have:
- Basic R programming experience
- Basic knowledge of Statistics and Probability
- Data Science background
Data Scientists and Data Analysts