Algorithms for Large Data Sets: Theory and Practice

Course
Identifier: 
COM S 4350

Offered during Fall Semester each year.

  1. Credits and contact hours: 3 credits, 3 contact hours
  2. Instructor’s or course coordinator’s name: Pavankumar Aduri
  3. Text book, title, author, and year None required
  4. Other supplemental materialsMining of Massive Data Sets, Leskovec, Rajaramn and Ullman

Specific course information

  1. Brief description of the content of the course: Algorithmic challenges involved in solving computational problems on massive data sets. Probabilistic data structures, Curse of Dimensionality and dimensionality reduction, locality sensitive hashing, similarity measures, matrix decompositions. Optimization problems in massive data analysis. Computational problems that arise in the context of web search, social network analysis, online advertising etc. Practical aspects include implementation and performance evaluation of the algorithms on real world data sets. Graduate credit requires a written report on current research.
  2. Prerequisites or co-requisites: COM S 311 or equivalent; for graduate credit: graduate standing or permission of instructor
  3. Required, elective, or selected elective? Selected Elective

Specific goals for the course

  1. Specific outcomes of instruction:
  • Ability to apply mathematical ideas and algorithmic principles in modeling computational problems that arise in the context of massive data sets. (1)
  • Ability to use the theoretical tools in a practical environment and design programs to solve computational problems that arise in the context of massive data sets. (6)

Brief list of topics to be covered

  • Basics of Probability Theory
    • Random Variables
    • Expectation, variance
    • Markov, Chebyshev and Chernoff Bounds
  • Probabilistic Data Structures
    • Theory behind hashing
    • Memory efficient data structures: Bloom Filters, CountMin Sketch
  • Similar Items
    • Measuring similarity among items
    • Document similarity
    • Dimensionality reduction
    • Locality sensitive hashing
    • Latent semantic analysis
  • Components of Search Engine
    • Crawling and ranking
      • Web crawlers
      • Google’s page rank algorithm
      • Hubs and authorities in web
  • Social Network Analysis
    • Substructures and communities
    • Influence propagation
  • Matrix Decompositions
  • Online Adverting and Recommendation Systems
    • Advertising on Web, placing ads along with search results, Ad words problem
    • Predicting items an individual user might like