iBigData
Integrating Big Data into the Computing Curricula
This website contains the accompanying resources of the papers our team has published about the integration of Big Data technologies into the computing curricula.
Paper: SQL: From Traditional Databases to Big Data
“SQL: From Traditional Databases to Big Data”, Yasin N. Silva, Isadora Almeida, and Michell Queiroz, in proceedings of the 47th ACM Symposium on Computer Science Education (SIGCSE '16), Memphis, Tennessee, USA, 2016.
Download SIGCSE Presentation Slides (PDF)
Abstract. The Structured Query Language (SQL) is the main programing language designed to manage data stored in database systems. While SQL was initially used only with relational database management systems (RDBMS), its use has been significantly extended with the advent of new types of database systems. Specifically, SQL has been found to be a powerful query language in highly distributed and scalable systems that process Big Data, i.e., datasets with high volume, velocity and variety. While traditional relational databases represent now only a small fraction of the database systems landscape, most database courses that cover SQL consider only the use of SQL in the context of traditional relational systems. In this paper, we propose teaching SQL as a general language that can be used in a broad range of database systems from traditional RDBMSs to Big Data systems. This paper presents well-structured guidelines to introduce SQL in the context of new types of database systems including MapReduce, NoSQL and NewSQL. A key contribution of this paper is the description of an array of course resources, e.g., virtual machines, sample projects, and in-class exercises, to enable a hands-on experience with SQL across a broad set of modern database systems.
Resources
SQL in MapReduce Systems
- Hive: SQL Queries on Hadoop
- Sample script
- Link to Virtual Machine site (Cloudera)
- Meteorological station data generator (MStation2)
- Sample MStation2 dataset
- Spark: Integrating SQL and MapReduce
SQL in NoSQL Systems
- Impala: SQL to Query HBase Tables
- SlamData: SQL on MongoDB
- Sample script
- Links to download Slamdata and MongoDB
- Income tax statistics dataset
SQL in NewSQL Systems
- Learning SQL with VoltDB
Paper: Integrating Big Data into the Computing Curricula
“Integrating Big Data into the Computing Curricula”, Yasin N. Silva, Suzanne W. Dietrich, Jason M. Reed, and Lisa M. Tsosie, in proceedings of the 45th ACM Symposium on Computer Science Education (SIGCSE '14), Atlanta, USA, 2014.
Download SIGCSE Presentation Slides (PDF)
Abstract. An important recent technological development in computer science is the availability of highly distributed and scalable systems to process Big Data, i.e., datasets with high volume, velocity and variety. Given the extensive and effective use of systems incorporating Big Data in many application scenarios, these systems have become a key component in the broad landscape of database systems. This fact creates the need to integrate the study of Big Data Management Systems as part of the computing curricula, particularly as part of database courses. This paper presents well-structured guidelines to perform this integration by describing the important types of Big Data systems and demonstrating how each type of system can be integrated into the curriculum. A key contribution of this paper is the description of a wide array of course resources, e.g., virtual machines, sample projects, and in-class exercises, and how these resources support the learning outcomes and enable a hands-on experience with Big Data technologies.
Resources
These are the power point slides prepared for the three Big Data learning units covered in the paper. Instructors are welcomed to re-use and modify these slides.
- MapReduce (Power Point)
- NoSQL (Power Point)
- NewSQL (Power Point)
The following resources are available to instructors. To request them, please send an email to Yasin Silva (ysilva [at] asu [dot] edu).
- VMWare Virtual Machine: This VM (OpenSuse Linux, Hadoop, HBase) contains most of the resources described in the paper: sample datasets, MStation data generator, MapReduce and NoSQL in-class exercices with solutions.
- MapReduce Project: Additional questions for a MapReduce project assignment.
- End-of-class survey: The survey used to evaluate the extent to which the learning objectives were achieved.