CS345A, Winter 2009: Data Mining.

Course Info | Handouts | Assignments | Project | Course Outline | Resources and Reading


Course Information

NEW NEW ROOM: 200-002. This is the big auditorium in the basement of the History Corner. It seats 163, so there should be plenty of room for us to spread out.

Instructors: Anand Rajaraman (anand @ kosmix dt com), Jeffrey D. Ullman (ullman @ gmail dt com).

TA: Anish Johnson (ajohna @ stanford dt edu).

Staff Mailing List: cs345a-win0809-staff@mailman.stanford.edu

Meeting: MW 4:15 - 5:30PM; Room: History Corner basement 200-002.

Office Hours:
Anand Rajaraman: MW 5:30-6:30pm (after the class in the same room)
Jeff Ullman 2-4PM on the days I teach, in 433 Gates.
TA: Anish Johnson Tuesdays: 9:15-10:45am in B26A Gates
Thursdays: 1-3pm in B24B Gates

Prerequisites: CS145 or equivalent.

Materials: There is no text. However, if you have the second edition of Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom), you will find Section 20.2 and Chapters 22 and 23 relevant. Slides from the lectures will be made available in PPT and PDF formats.

Students will use the Gradiance automated homework system for which a fee will be charged. Note: if you already have Gradiance (GOAL) privileges from CS145 or CS245 within the past year, you should also have access to the CS345A homework without paying an additional fee. Notes and/or slides will be posted on-line.

You can see earlier versions of the notes and slides covering Data Mining. Not all these topics will be covered this year.

Requirements: There will be periodic homeworks (some on-line, using the Gradiance system), a final exam, and a project on web-mining. The homework will count just enough to encourage you to do it, about 20%. The project and final will account for the bulk of the credit, in roughly equal proportions.


Handouts

DateTopicPowerPoint SlidesPDF Document
1/7 Introductory Remarks (JDU) PPT PDF
1/7 Introductory Remarks (AR) PPT PDF
1/12 Map-Reduce PPT PDF
1/14 Frequent Itemsets 1 PPT PDF
1/14-1/21 Frequent Itemsets 2 PPT PDF
1/16 Peter Pawlowski's Talk on Aster Data PPTX PDF
1/16 Nanda Kishore's Talk on ShareThis PPT PDF
1/26 Recommendation Systems PPT PDF
1/28 Shingling, Minhashing, Locality-Sensitive Hashing PPT PDF
2/2 Applications and Variants of LSH PPT PDF
2/2-2/4 Distance Measures, Generalizations of Minhashing and LSH PPT PDF
2/4 High-Similarity Algorithms PPT PDF
2/9 PageRank PPT PDF
2/11 Link Spam, Hubs & Authorities PPT PDF
2/18 Generalization of Map-Reduce PPT PDF
2/18-2/23 Clustering PPT PDF
2/23 Streaming Data PPT PDF
2/25 Relation Extraction PPT PDF
3/2 On-Line Algorithms, Advertising Optimization PPT PDF
3/4 Algorithms on Streams PPT PDF

Assignments

There will be assignments of two kinds.

Gradiance Assignments

Some of the homework will be on the Gradiance system. You should go there to open your account, and enter the class token 83769DC9. If you have taken CS145 or CS245 within the past year, your account for that class should grant you free access for CS345. If not, you will have to purchase the access on-line. Note: If you have to purchase access, use either Garcia-Widom-Ullman, 2nd Edition or Ullman-Widom 3rd Edition (the books used for 145 and 245). Do not purchase access to the Tan-Steinbach-Kumar materials, even though the title is "Data Mining."

You can try the work as many times as you like, and we hope everyone will eventually get 100%. The secret is that each of the questions involves a "long-answer" problem, which you should work. The Gradiance system gives you random right and wrong answers each time you open it, and thus samples your knowledge of the full problem. While there are ways to game the system, we group several questions at a time, so it is hard to get 100% without actually working the problems. Also notice that you have to wait 10 minutes between openings, so brute-force random guessing will not work.

Solutions appear after the problem-set is due. However, you must submit at least once, so your most recent solution appears with the solutions embedded.

Challenge Problems

These are more complex problems for which written solutions are requested. They will be "lightly graded," meaning that we shall accept any reasonable attempt, and those doing exceptionally well will get "extra credit," but there will not be exact numerical grades assigned.

AssignmentDue Date
Gradiance HW #1 Monday, January 26 (11:59PM)
Challenge Problems #1 Solution Wednesday, January 28 (In class)
Gradiance HW #2 Wednesday, January 28 (11:59PM)
Challenge Problems #2 Solution Wednesday, February 4 (In class)
Gradiance HW #3 Wednesday, February 4 (11:59PM)
Project Proposal Monday, February 9 (11:59PM)
Gradiance HW #4 Wednesday, February 11 (11:59PM)
Gradiance HW #5 Wednesday, February 18 (11:59PM)
Gradiance HW #6 Monday, March 9 (11:59PM)
Gradiance HW #7 Wednesday, March 11 (11:59PM)
Final Project Report Due Wednesday March 11 (In class)

Project

CS345A Project specification:

Course Outline

Here is a tentative schedule of topics:

DateTopicLecturer
1/7 Introduction JDU, AR
1/12 Map-Reduce AR
1/14 Frequent Itemsets JDU
1/16 Special Lecture on Aster/Map-Reduce, ShareThis 5:15PM in B12 Gates
1/21 Frequent Itemsets JDU
1/26 Recommendation Systems AR
1/28 Similarity Search JDU
2/2 Similarity Search JDU
2/4 Similarity Search JDU
2/9 Link Analysis AR
2/11 Spam Detection AR
2/18 Generalizing Map-Reduce
Clustering JDU
AJA
2/23 Clustering, Streaming Data JDU
2/25 Extracting Structured Data from the Web AR
3/2 Advertising on the Web AR
3/4 Stream Mining JDU
3/9 Stream Sampling DEK
3/11 Project Reports students
3/12 Project Reports students; Rm. 260-012
3/13 Project Reports students; Rm. 260-012
3/19 Final Exam 12:15-3:15PM, Rm. 200-002 (regular classroom)

References and Resources

AltStyle によって変換されたページ (->オリジナル) /