This course covers algorithmic techniques for solving different types of data science problems. Topics include algorithm design paradigms (divide-and-conquer, greedy, dynamic programming), graph algorithms (shortest paths, minimum spanning trees), computational intractability and NP-completeness, fundamental techniques for processing massive data sets (streaming and approximation algorithms), and basics of supervised learning. You will learn to design, analyze (via theorems and proofs), and implement fundamental data structures and algorithms. This course will also provide the algorithmic background essential for further study of computer science topics.
In more detail, from this course you will learn the basic language and tools for algorithm analysis, as well as several specific problems and general paradigms for algorithm design. We will focus on the theoretical and mathematical aspects in class and on the homework assignments. But because one gains a deeper understanding of algorithms from actually implementing them, the course will include a substantial programming component. Large programming assignments can be done (but do not have to be done!) in pairs. More details will be available when the first programming assignment is given.
We will be covering a great deal of material in this class. I expect the course to be challenging, both in terms of the workload and the difficulty of the material. You should be prepared to do a lot of work outside of class. The payoff will be that you will learn a lot of both useful and interesting things.
The formal prerequisites for this course are ANLY-555 and Python.
Some mathematical maturity also will be expected; students should have some idea of what constitutes a mathematical proof and how to write one. Some knowledge of basic probability will also be helpful.
There is no required course textbook. We will primarily be following lecture notes from various sources, all linked to on this web page. The schedule and lecture notes that we will be using are posted below (they are, however, likely to evolve over the course of the semester).
Grades in the course will be based on problem sets/quizzes (80%) and participation (20%).
I would like to emphasize the rules on working with others on homework assignments. For specified (programming-intensive) assignments, you may work in pairs. Each group may discuss the problem set with at most one other group, but looking at another group's code is strictly forbidden.
For any assignment with a programming component, you may not use code found online without my express permission. If you want to use code you find online, you must ask me first.
For problem sets, limited collaboration in planning and thinking through solutions to homework problems is allowed, but no collaboration is allowed in writing up solutions. That is, you are allowed to work with other students currently taking the course in discussing, brainstorming, and verbally walking through solutions to homework problems. But when you are through talking, you must write up your solutions independently and may not check them against each other. There may be no passing of homework papers between collaborators; nor is it permissible for one person simply to tell another the answer.
If you collaborate with any other students in the course in the planning and design of solutions to homework problems, then you should give their names on your homework papers.
Violation of these rules may be grounds for giving no credit for a homework paper and also for serious disciplinary action. Severe punishments will apply, so please do not violate these rules.
After class or by appointment.
Below is the tentative schedule for the course. I will modify it as the semester progresses.
Class Number | Date | Description | Readings |
Algorithm Design Paradigms | |||
1 | 1/26 | Overview of Algorithm Design, Grade-School Algorithm For Multiplication, Computing Fibonacci Numbers, Models of Computation, Induction, Big-Oh Notation | Mitzenmacher's lecture notes 1 and lecture notes 2. Also read Suresh Venkatasubramanian's musings on the relationship between algorithm design and machine learning. |
2 | 2/2 | Recurrence relations. Binary Search. Divide and Conquer Algorithms: Merge Sort, Fast Integer Multiplication. Strassen's Algorithm for Matrix Multiplication. | Mitzenmacher's lecture notes. |
3 | 2/9 | Dynamic Programming (String Reconstruction, Edit Distance). | Mitzenmacher's lecture notes. |
Graph Algorithms | |||
4 | 2/16 | Intro to Graphs, Adjacency Lists and Adjacency Arrays, Breadth-First Search. DFS and topological sorting. | Demaine's notes and Mitzenmacher's notes |
5 | 2/23 | Dijkstra's Algorithm for Single-Source Shortest Paths. | Mitzenmacher's lecture notes. |
6 | 3/2 | Minimum Spanning Trees---Prim's and Kruskal's Algorithms. Floyd-Warshall (All-Pairs Shortest Paths). | Mitzenmacher's lecture notes on Prim's and Kruskal's algorithms. For the union-find data structure, see these notes. |
Reductions and NP-Completeness | |||
7 | 3/9 | Reductions. | Mitzenmacher's lecture notes For the reduction from maximum bipartite matching to max-flow, see these slides. |
8 | 3/16 | NP, NP-completeness, some NP-complete problems | Mitzenmacher's lecture notes |
Advanced Topics | |||
9 | 3/23 | Approximation algorithms for intractable problems. Fingerprinting and the power of randomness. | Mitzenmacher's lecture notes . For fingerprinting see these notes. |
10 | 4/6 | Intro to streaming algorithms. Finding frequent items and point queries in insert-only streams. Tools from probability theory. | Lectures 1 and 2 of Amit's notes |
11 | 4/13 | Sampling Algorithms for Insert-Only Streams: Approximate Medians, Reservoir sampling, Itemset Frequency Estimation. Estimating Distinct Elements in Insert-Only Streams. | Andrew's slides |
12 | 4/20 | Answering Point Queries for turnstile streams: Count-Min and Count Sketch. Linear Sketches, Johnson-Lindenstrauss, Random Projections for Dimensionality Reduction. | Lecture 4 of Amit's notes |
13 | 4/27 | Probably Approximately Correct (PAC) learning. Occam's razor. | TBD |
14 | 5/4 | TBD | TBD |