COSC-282 Big Data Analytics - Fall 2015
Department of Computer Science
Georgetown University

Go back
Course Description:

Today, information retrieval and Web search technologies play a central role in information seeking and knowledge distribution across the globe. The growth of the Web and the improvements in data creation, collection, and use have lead to tremendous increase in the amount and complexity of the data that a search engine needs to handle. "Big data" presents challanges to search engines from three perspectives: bigger data volume, higher data complexity, and faster data change rate. The increase of the magnitute and complexity of the data has become a major drive for new information retrieval algorithms and technologies that are scalable, highly interactive, and able to handle complex and dynamic information seeking tasks in the big data era.

In this class, we will focus on information retrieval algorithms and programming based on Big Data. We will cover programming models that allow us to easily distribute computations across large computer clusters. In particular, we will teach Apache Spark, which is an open-source cluster computing framework that has soon become the state-of-the-art for big data programming. In contrast to Hadoop's MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. Spark provides clean API in JAVA, Scala, Python and R. This course will provide an introduction to Spark, focusing specifically on search engine design and programming upon Spark and Scala and "thinking at scale". We will also cover other components in the Spark ecosystem, such as machine learning with MLib.

Prerequisites:
  • COSC-160 Data Structure
Time and Location:

Monday, Wednesday 11:00-12:15. Reiss 112

Instructor: Prof. Grace Hui Yang
TA: Jiyun Luo
Hongkai Wu
Textbooks:

  • Learning Spark. Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia. O'Reilly Media Press. 2014.
  • learning spark

    Here is the Amazon link to this book.

  • Information Retrieval. Implementing and Evaluating Search Engines. Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack. The MIT Press. July 2010.
  • information retrieval

    Here is the Amazon link to this book.

    Other Readings:

    Selected papers or book chapters will be made available before lectures.

    Grading: Homeworks 70% (7% each), Quizzes 5%, Midterm exam 10%, Final exam 15%. Optional Homework 10%.
    Resources: Piazza link: https://piazza.com/georgetown/fall2015/cosc282/home
    Blackboard link: https://uis.georgetown.edu/services/blackboard
    Policies: Homework policy: All homeworks should be submitted through Blackboard. Homeworks are due 11:59pm on the due date. Three late days in total are allowed without penalty for the entire semester. For instance, you may be late by 1 day for homework 1 and be late by 2 days for homework 2. Once the three-late-dates are used, you will be penalized according to the policy below:
    • a penalty of 50% will be applied for homework submitted within the next 24 hours.
    • zero credit will be assigned after that.
    All homeworks (even with zero credit) must be turned in order to pass the course.

    Integrity policy: All experimental results turned in must be true. No copying/cheating is allowed. Please check Georgetown's Honor system.

    Syllabus
    Date Class Readings Slides Notes
    9-2 Introduction; Install Spark and Get Started; Scala Crash Course Ch1; Ch2; Scala Tutorials slides Assignment 1 out. Due 9/9.
    9-7 Labor Day. No class.
    9-9 More Scala; Programming with RDD Ch3. slidesAssignment 2 out. Due 9/16.
    9-14 Programming with RDD; Loading and Saving your data. Ch3; Ch5. slides
    9-16 Programming with RDD; Map Ch3. slidesAssignment 3 out. Due 9/23.
    9-21 Programming with RDD; Transformation Ch3. slides
    9-23 Programming with RDD; Actions Ch3. slidesAssignment 4 out. Due 10/7.
    9-28, 9-30 No class.
    10-5 Text Processing; Bag of Words; Lemmatization IR Ch3 slides
    10-7 Advanced Spark Programming Ch6. slides Assignment 5 out. Due 10/14.
    10-12 No class.
    10-14 Machine Learning w MLib; Vectors; XML Parser Ch11 slidesAssignment 6 out. Due 10/21.
    10-19 Query and Documents; Vector Space Model; KL Divergence IR Ch2. slides
    10-21 Relevance and Novelty IR Sec. 2.3 slides Assignment 7 out. Due 11/4.
    10-26 Miderm review
    10-28 Miderm exam
    11-2 Web Search IR Ch15. slides
    11-4 Link Graph; Link Analysis Ch4; IR Ch15 slides Assignment 8 out. Due 11/11
    11-9 PageRank IR Ch15; PageRank papers slides
    11-11 Key/Value Pairs Ch4. slidesAssignment 9 out. Due 11/25.
    11-16, 11-18 No class.
    11-23 Key/Value Pairs Ch4. slides
    11-25 Web Knowledge Base; Topic-Oriented PageRank IR Ch15. slidesAssignment 10 out. Due 12/7. Assignment 11 out. Due 12/16..
    11-30 Recommender Systems IR Sec. 10.1. slides
    12-2 Dynamic Search Papers slides
    12-7 Social Search Papers slides
    12-9 Conclusion and Term Review slides Assignment 10 due.
    12-14 Study week starts.
    12-16 Study week. Assignment 11 due.
    12-18 9-11am Final exam
    Companies using Spark: company list