COSC 548 - Streaming Algorithms

Fall 2016

Georgetown University

Prof. Justin Thaler

Tuesday and Thursday, 9:30 to 10:45

St. Marys 120

Email: justin.thaler at georgetown dot edu

Course Overview

Description

The data streaming model captures settings in which there is so much data that one can only store a tiny fraction of it. It also captures settings where one can store the dataset, but cannot afford to look at the full input every time one wants to answer a question about the data. The goal of a streaming algorithm is to output a very small summary, or "sketch" of the data, such that one can still use the summary to (approximately) answer basic questions about the data. The streaming algorithm will ideally compute the summary in a single pass over the input, with each datum (i.e., stream update) being processed very quickly.

Why you should take this course. There is the obvious reason that the amount of data in the world is exploding. Hence, the stringent constraints of the data streaming model increasingly capture real-world situations. In addition, this course will teach basic methods and tools for designing and analyzing randomized algorithms. This toolset and way of thinking will be broadly useful even outside of streaming environments.

Who Can Take This Course

This course is listed as a graduate elective and therefore its primary audience is computer science masters and doctoral students. While there are no formal prerequisites for the course, familiarity with elementary probability theory and data structures is expected. General mathematical maturity, e.g. comfort with proofs, will also be assumed.

Reference Material

There is no course textbook. We will primarily be following Amit Chakrabarti's course notes, and occasionally Andrew McGregor's slides. I will post a reference for each lecture by the start of class.

Grading

Grades in the course will be based on three problem sets and a final project. The problem sets will count for 60% of the grade, while the final project will count for 40%.

This course focuses on the design and analysis of streaming algorithms from a theoretical perspective. While I will occasionally discuss implementation issues and practical efficiency in lecture, the problem sets will not require programming.

The final project can be either theory or implementation-based. Example course projects are: (1) Read a recent research paper related to a topic covered in the class and write a comprehensive, readable summary and your opinion of it; (2) Work on a research problem related to the course material and your own research, and write a paper following a conference-paper-like format that describes your findings; (3) Implement one or more algorithms covered in the course or from the research literature, and do a comprehensive performance evaluation of your implementation.

The course project can be done collaboratively in groups of up to 3 people. In fact, collaboration is encouraged. However, the more people in a group, the more substantial the project is expected to be.

Academic Honesty

Academic honesty is taken very seriously, in accordance with Georgetown's Honor System. For problem sets, you are welcome to work with others or use other resources, but when you actually write your solutions you must do so by yourself as if you are taking an exam. You must also explicitly list all collaborators with whom you worked and any references or online material you used, separately for each problem. Most importantly, you must never turn in something that you do not understand. I reserve the right to ask you to explain to me something that you turned in; if you cannot do so in a way that demonstrates your understanding, then you will receive no credit and possibly be reported to the Honor Council.

Lateness Policy for Problem Sets

Problem sets should be handed in by the end of lecture on the day they are due. That is, bring a physical copy of the problem set to class to give to me at the end of lecture. Problem sets not handed in by the end of lecture will be considered late, and 10% of the points deducted. If a problem set is late, you may email it to me rather than hand in a physical copy. Problem sets not handed in until the day after the deadline will have an additional 10% deducted. Beyond this point, problem sets will not be accepted.

Course Logistics

Office Hours

After class or by appointment. My office is currently Car Barn 312-B, but I will most likely move to an office in St. Mary's Hall in October.

Schedule

Below is the highly tentative schedule for the course. I will modify it as the semester progresses. Topics and readings will not become "official" until the date of lecture. Hence, I do not recommend reading ahead by more than one lecture, if at all.

Class NumberDateDescription Readings
Part 1. Insert-Only Streams
19/1Introduction to streaming algorithms; course overview; fingerprinting and the power of randomnessLecture 0 of Amit's notes. Also read Lee Rhodes' exposition of the benefits of using sketching algorithms in industrial systems. For fingerprinting, see Justin's handwritten notes
29/6Answering Point Queries and Finding Frequent Items in Insert-Only Streams Lecture 1 of Amit's notes. See also Justin's handwritten notes
39/8Tools from probability theory: Markov's Inequality, Chebyshev's Inequality, Chernoff Bounds, Inclusion-Exclusion Principle, Variance Reduction via Averaging. Homework 1 Released A handy reference on tools from probability theory is Grigory's slides. See also Justin's handwritten notes
49/13Sampling Algorithms for Insert-Only Streams: Approximate Medians, Reservoir sampling, Itemset Frequency Estimation. Andrew's slides. See also Justin's handwritten notes.
59/15AMS Sampling For Estimating High Frequency Moments. Lecture 5 of Amit's notes. See also Justin's handwritten notes.
69/20Estimating Distinct Elements in Insert-Only Streams, Part 1: Simple, supoptimal algorithm. Lecture 2 of Amit's notes. See also Justin's handwritten notes
79/22Estimating Distinct Elements in Insert-Only Streams Part 2: Hyperloglog, KMV, Adaptive Sampling For Adaptive Sampling, see Lecture 3 of Amit's notes. For other content, see Justin's handwritten notes
89/27Detour: Dictionary Data Structures: Chaining, Linear Probing, Cuckoo Hashing, The Power of Two Choices Justin's handwritten notes.
Part 2. Turnstile Streams: Linear Sketches
99/29 Answering Point Queries: Count-Min and Count Sketch. Homework 1 due. Lecture 4 of Amit's notes See also Justin's handwritten notes.
1010/4 Finishing Count Sketch Analysis. Estimating the Second Frequency Moment via the Tug-of-War Sketch. Homework 2 released. Lecture 6 of Amit's notes. For the Count Sketch, see also Justin's handwritten notes.
1110/6 Linear Sketches, Johnson-Lindenstrauss, Random Projections for Dimensionality Reduction, Estimating Small Frequency Moments via Stable Distributions Lecture 7 of Amit's notes See also Justin's handwritten notes
1210/11 Bloom Filters. Sparse Recovery via Invertible Bloom Lookup Tables. Peeling Algorithms. For Sparse Recovery algorithms, see these slides. For Bloom Filters, see the excellent Wikipedia article or Michael Mitzenmacher's lecture notes. Justin's handwritten notes may also be useful.
1310/13 Applications of Sparse Recovery Algorithms: Set Reconciliation, Biff Codes, L_0 Sampling in Turnstile Streams. Intro to Graph Streams. For everything except graph streams, see Justin's handwritten notes. For L_0 Sampling, see also Andrew's slides.
1410/18 Homework Review
1510/20 Streaming Graph Algorithms: Graph Connectivity and Spanners in the Insert-Only Model. Graph Connectivity in the Turnstile Model via L_0 Sampling. For an intro to graph streams and connectivity/spanners in the insert-only model, see Andrew's Lecture 2.1 slides. For Graph Connectivity in the turnstile model, see Andrew's Lecture 2.2 slides. See also Justin's handwritten notes.
Assorted Topics
1610/25 Lower bounds via communication complexity, Part 1 Homework 2 due. Homework 3 released. Lecture 15 in Amit's notes
1710/27 Lower bounds via communication complexity, Part 2 Lecture 16 in Amit's notes
1811/1 Lower bounds via communication complexity, Part 3 Lecture 17 in Amit's notes
1911/3 Stream Verification Part 1 These slides and these slides together cover much of the content.
2011/8 Problem Set Review No reading.
2111/10 Stream Verification Part 2 See these slides
2211/15 Exact Medians Using Multiple Passes in Insert-Only Streams Lecture 9 in Amit's notes
2311/17 Geometric streams, coresets, approximating Minimum Enclosing Ball Homework 3 due Lecture 11 in Amit's notes
2411/22 CR-Precis. Mergeable Summaries. For CR-Precis, see Slide 11 of Andrew's Slides. For mergeable summaries, see the original paper by Agarwal et al.
2511/29Clustering Lecture 12 in Amit's notes and Andrew's Slides
2612/1 Problem Set Review (probably) Reading TBD
2712/6 Other notions of sublinear algorithms: Property testing. Testing of Uniformity. Paul Beame's lecture notes

Problem Sets

Suggested Papers of Interest for the Final Project

Recent Papers on Frequent Items and Quantiles Algorithms Recent Papers on Graph Streams Recent Papers on Streaming Algorithms for Clustering Assorted Recent Papers Papers on Streaming Verification David Woodruff's DBLP page provides a long list of interesting papers, many on streaming algorithms and accessible via a quick Google Scholar search.