Big Data Analysis with Hadoop and RHadoop


Register now via EuroCC Slovenia


Online Course: Big Data analysis with Hadoop and RHadoop

19 - 20 October 2022, 13:00 - 17:00 CEST


Course for academia, industry, public sector and general public
Organised by VSC Research Center (TU Wien) in cooperation with EuroCC Austria,
EuroCC Slovakia and EuroCC Slovenia
Language: English
Location: Zoom
Price: free


This training course will focus on the foundations of “Big Data” processing by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop, Rhadoop, and R libraries parallel, doParallel, foreach and Rmpi. Although online, the course will be hands-on, allowing participants to work interactively on real data on the High Performance Computing environment of the University of Ljubljana and on the Vienna Scientific Cluster.

The training event will consist of two 4-hour trainings in two consecutive days. The first day will focus on big data management and data analysis with Hadoop. The participant will learn how to (i) move big data efficiently to a cluster and to Hadoop distributed file system, and (ii) how to perform simple big data analysis by Python scripts using MapReduce and Hadoop. The second day will focus on big data management and analysis using R and Rhadoop. We will first stick to work within RStudio and will write all scripts within R using several state-of-the-art libraries for parallel computations, like parallel, doParallel, foreach, Rmpi and libraries to work with Hadoop, like rmr, rhdfs and rhbase. Finally, we will show how to perform parallel slurm jobs with R scripts.

Agenda

19 October 2022 (Day 1)

13:00   Introduction
13:15   Introduction to HADOOP 
           – Introduction to Big Data 
           – The Hadoop Distributed Computing Architecture
           – First hands-on exercise on the cluster
14:00   Break
14:15   HDFS
           – The Hadoop Distributed File System: 
              blocks, partitions, load balancing, replication/erasure coding, fault tolerance, data locality
           – Hands-on example: managing data on HDFS
15:00   Break
15:15   MapReduce (MR)
           – Explaning the MR computing model
           – Split/map/sort & shuffle/combine/reduce
           – Hands-on demos
16:00   Break
16:15   Hands-on exercise with MR
17:00   End

20 October 2022 (Day 2) 

13:00   Introduction to Day 2
13:15   Introduction to R
           – Connecting to RStudio web server at HPC@UL
           – Creating and running own R scripts
           – Creating, retrieving, saving data files
           – Standard data management operations on data frames
           – Data management with dplyr, magritt
14:00   Break
14:15   Advanced and Big data management with R
           – Data manipulations with apply functions apply, lapply, sapply, vapply, tapply and mapply
           – Big Data management and analysis using one computing node with functions
              for efficient parallel loops parLapply, parSapply, mcLapply and foreach-dopar
15:00   Break
15:15   Big Data management and analysis with Rmpi and RHadoop
           – Big Data management and analysis using many computing nodes and library Rmpi
           – Preparing and storing big data to HDFS using rhdfs library
           – Retriving from and managing big data in HDFS by plyrmr and rhdfs library
16:00   Break
16:15   Big data analysis with RHadoop
           – Preparing map-reduce scripts to make basic data analysis tasks 
              (extreme values, counts, mean values, dispersions, visualisations) using rhdfs library
17:00   Wrap-up
17:05   End

Prerequisites

For the first day: basic Linux shell commands, Python
For the second day: basic Linux shell commands and R

Hands-on labs

The participants will need a local machine to connect to the supercomputers at the University of Ljubljana and to the Vienna Scientific Cluster. Before the start of the course they will get training accounts on these supercomputers for running all examples.

Lecturers

Janez Povh (University of Ljubljana, Slovenia)
Lucia Absalon Bautista (University of Ljubljana, Slovenia)
Giovanna Roda (EuroCC Austria, BOKU, and TU Wien, Austria)
Liana Akobian (TU Wien, Austria)

Prices and eligibility

This course is offered free of charge via EuroCC AustriaEuroCC Slovakia, and EuroCC Slovenia, and it is open for participants from academia and industry from the Member States (MS) of the European Union (EU) and Associated/Other Countries to the Horizon 2020 programme.

Registration

Register via EuroCC Slovenia: https://indico.ijs.si/event/1522/ 

Course material

The slides will be available at course start at the EuroCC Slovenia course page and the exercises will be prepared for you on the HPC clusters.

Contact: eurocc@sling.si

 


Back to training events