Aditya Parameswaran

I am an Associate Professor at the University of California, Berkeley, with a joint appointment at the I School and EECS. I co-direct the EPIC Data Lab. I co-founded Ponder, which was acquired by Snowflake in October 2023. I am part of the Data Systems & Foundations and Human-Computer Interaction groups, and I am affiliated with the Berkeley Institute of Data Science.

My research interests are broadly in building tools for simplifying data science at scale, i.e., empowering individuals and teams to leverage and make sense of their large datasets more easily, efficiently, and effectively.

We are always looking for postdocs, PhD, MS, and UG students or research/development staff to join our efforts! If you are a postdoc or staff applicant, feel free to email me directly with your CV and qualifications. If you are an aspiring PhD student, please apply to the EECS or I School PhD programs. If you are an MS or UG student, feel free to fill out this form: it is rare that we will work with UG/MS students outside UC Berkeley except in cases of unusually good fit.

Formal Biographical Sketch

Aditya Parameswaran is an Associate Professor in the School of Information (I School) and Electrical Engineering and Computer Sciences (EECS) at UC Berkeley. Aditya co-directs the EPIC Data Lab, a lab targeted at low/no-code data tooling with a special emphasis on social justice applications. Until its acquisition by Snowflake in October 2023, Aditya served as the President of Ponder, a company he co-founded with his students based on popular data science tools developed at Berkeley. Aditya is affiliated with the Berkeley Institute of Design, and the Berkeley Institute of Data Science — and is part of the Data Systems & Foundations and Human-Computer Interaction groups at Berkeley. Aditya develops human-centered tools for scalable data science — making it easy for end-users and teams to leverage and make sense of their large and complex datasets — by synthesizing techniques from data systems and human-computer interaction. His visualization and data exploration tools have been downloaded and used by millions of users in a variety of domains.

Click here for a longer bio.

Until June 2019, Aditya was an Assistant Professor in Computer Science at the University of Illinois. He spent a year as a PostDoc at MIT CSAIL following his PhD at Stanford. Aditya has received various awards for his work, including being named as Kavli Fellow from the National Academy of Sciences (2023), the IIT Bombay Young Alumnus Achiever award (2022), SIGMOD 2020 Best Paper Award (2020), Alfred P. Sloan Research Fellowship (2020), VLDB Early Career Research Contributions Award (2019), the Army Research Office Young Investigator Program Award (2018), the NSF CAREER Award (2017), the TCDE Rising Star Award (2017), the Dean's Excellence in Research Award (2018) and the C. W. Gear Junior Faculty Award (2017) from the University of Illinois, multiple "best" doctoral dissertation awards (from SIGMOD, SIGKDD, and Stanford in 2014), "Excellent" teacher awards from Illinois (2015, 2017), a Google faculty award (2015) and focused research award (2017), the Key Scientific Challenges award from Yahoo!, six best-of-conference citations (VLDB 2010, KDD 2012, ICDE 2014, ICDE 2016, AISTATS 2017, VLDB 2017), a best demo award (ICDE 2019) and a best demo honorable mention (SIGMOD 2017).

In the last five years, he has served as Faculty Equity Advisor for the I School, the Workshops Co-Chair at SIGMOD, the US Sponsorship Chair at VLDB, and on the steering committee of the HILDA (Human-in-the-loop Data Analytics) Workshop @ SIGMOD and the DSIA (Data Systems for Interactive Analysis) Workshop @ VIS. He has also served on program committees of various database, data mining, web, systems, visualization, and crowdsourcing conferences, as well as as an associate editor of SIGMOD Record (2014-2019). His research group is supported with funding from the NSF (NRT,Future of Work, CAREER, Medium, AITF, BigData), the NIH (2X), the Army Research Office, the Sloan Foundation, the Toyota Research Institute, Microsoft, Sigma Computing, Adobe, the Siebel Energy Institute, Facebook, and Google.

News

November 8, 2023: Excited to announce a collaboration with LangChain, a leading LLM framework, to support automatic discovery of evaluation functions and assertions by analysis of prompt version histories. Blog post here.
October 24, 2023: Delighted to announce that our company Ponder was acquired by Snowflake, the leading cloud data warehouse vendor, to accelerate their data science offerings. It has been a wild ride, from inception in 2021, to a $7M seed fundraise by leading VC firms, to scaling the company up to 15 employees, to building a never-before-product running data science libraries in data warehouses, to acquisition. Posts from Snowflake, VentureBeat, and us.
August 30, 2023: We presented three full papers (one by Dixin, one by Shreya, and one jointly by Shreya and Stephen) and one demo paper (by Dixin) at VLDB.
August 29, 2023: Here's a blog post describing our VLDB paper on Transactional Panorama, led by Dixin Tang.
August 25, 2023: We are part of a new NSF-funded $3M AI training program called CRELS (Computational Research for Equity in the Legal System), with colleagues in social sciences and statistics.
August 15, 2023: Welcome to Yiming Lin, our newest postdoc, who comes to us from UC Irvine!
August 8, 2023: We released a new preprint on how one can leverage ideas from declarative crowdsourcing to improve accuracy and reduce cost when processing data with Large Language Models. It was nice to revisit work from a decade ago!
June 28, 2023: The CA state budget includes $6.87M in funding to support the Police Records Access project, which we are part of, along with our partners at the Stanford and Berkeley Schools of Journalism.
June 27, 2023: Nice to see a shoutout to our work on police roster data cleaning and extraction from a local journalism article on San Francisco law enforcement settlements. Congrats Aditya M and Diana Q!
June 18, 2023: I enjoyed giving a keynote at the DEEM workshop at SIGMOD, with my talk titled "Enhance, don't Replace: a Recipe for Success in Data Science Tooling".
May 31, 2023: Dixin Tang, a postdoc working with me, will be an assistant professor at University of Texas, Austin. Go Dixin!
May 31, 2023: Dixin and Fanchao's visualization tool for spreadsheet computation networks was accepted at VLDB'23.
May 24, 2023: I enjoyed appearing on the data stack show, talking about how decoupling APIs from execution engines is important.
May 18, 2023: Our work with journalists and public defenders as part of the EPIC lab was highlighted in the LA Times news article about the new college of computing, data science, and society.
May 11, 2023: I enjoyed serving on a panel on AI-meets-databases at the NorCal DB day.
May 1, 2023: Our paper on GATE, an efficient, automated, and precise system for ML-centric data validation is up on arxiv.
March 27, 2023: Our work with journalists as part of the EPIC data lab is having real impact: we helped journalists identify that responses to a FOIA request was inadequate, as detailed in this news story.
March 17, 2023: Modin now supports NumPy in addition to Pandas; we can push down linear algebra operations to distributed computing engines.
March 6, 2023: I wrote a rejoinder to the "Big Data is Dead" blogpost: read it here.
March 4, 2023: Delighted to be inducted as a Kavli Fellow by the National Academy of Sciences at the US Frontiers of Science Symposium in Irvine.
February 9. 2023: Enjoyed speaking at the Data Dividend, targeted at "harnessing the power of people, processes and technology to unlock value from data", an event hosted by the Economist and by IBM
February 10, 2023: Dixin's paper on compaction of spreadsheet graphs was accepted at ICDE!
January 13, 2023: Dixin's paper on transactional panorama — bridging transactions and interactive dashboards — was accepted at VLDB!
January 1, 2023: What a year for open-source efforts! Modin reached 5M downloads, Lux reached 500K downloads, and Nbsafety reached 150K downloads.
December 1, 2022: Modin now ships as part of the AWS SDK for Pandas as well as AWS Glue — showcasing how useful Modin is for data wrangling and ETL with the pandas API at scale.
November 29, 2022: Congrats to my former student, Doris Lee, on being named a Forbes' 30 under 30 in Enterprise Technology!

Click here for more news.

September 16, 2022: Shreya and Rolando's paper on lessons from interviewing ML Engineers on MLOps practices and challenges has been posted on arxiv.
September 13, 2022: We formally unveiled our new lab, the EPIC Data Lab. Here is an article from CDSS about the lab. We are thankful to our current sponsors, Microsoft, Google, and Sigma Computing, along with the National Science Foundation.
September 1, 2022: Our paper on NBSlicer (led by Shreya and Stephen) and a vision paper on MLOps (led by Shreya) were both accepted at VLDB'23!
August 18, 2022: Here's our fourth blog post comparing Pandas to SQL.
July 26, 2022: Here's our third blog post comparing Pandas to SQL.
July 14, 2022: Here's our second blog post comparing Pandas to SQL.
June 30, 2022: Our CACM paper on ShapeSearch — the invited extended version of our SIGMOD best paper award, is out here.
June 28, 2022: Here's our first blog post comparing Pandas to SQL.
June 23, 2022: Congrats to Drs. Doris Lee, Devin Petersohn, and Doris Xin on their graduation!
June 15, 2022: New paper from Joe and I on our new data engineering class at Cal!
June 1, 2022: New paper with CMU folks on extending Lux to remember analysis history in a notebook, and use it to bias recommendations towards recent operations or past interest - to appear at Eurovis.
May 20, 2022: We released a position paper on the vision behind the Sky lab.
May 19, 2022: Startups all-around! My former student, Doris Xin now serves as a CEO of a startup, Linea.
April 30, 2022: Enjoyed giving a keynote at the SIAM Data Mining Conference.
March 9, 2022: Excited to announce the launch of Ponder, a company I co-founded with my students. Ponder raised $7M from top-tier VCs to develop scalable, enterprise-ready data science tools, leveraging our success with Modin and Lux. I am serving as the President of Ponder, while staying on as faculty at Berkeley.
February 15, 2022: Modin hit 1.5M downloads this week, with 200K downloads in the last month - grateful to see the groundswell of adoption.
February 2, 2022: Delighted to receive the Young Alumni Achiever's award from my alma mater, IIT Bombay.
January 6, 2022: I presented a talk on our mission to develop enterprise-ready pandas at CIDR, see recording here.
November 18, 2021: Doris presented Lux at the Data Umbrella; see recording here.
November 11, 2021: Shreya Shankar presented her work on MLTrace at the Toronto ML Society as well as at Facebook; recording here.
October 26, 2021: Stephen Macke was interviewed about NBSafety in the Software Engineering Daily Podcast.
October 28, 2021: Doris Lee presented Lux at PyData, while Modin was featured on the Anaconda blog.
October 1, 2021: Our papers on Lux and Modin were accepted to VLDB 2022!
September 15, 2021: Tenure!
September 7, 2021: We received a 2M NSF grant to kickstart the EPIC Data Lab, short for Effective Programming, Interaction, and Computation with Data. Articles on EPIC here and here. We're working on the messy data challenges in criminal justice, along with public defenders via the NACDL (National Association of Criminal Defense Lawyers) and journalists via the Big Local News/KQED. We're part of a consortium called CLEAN: Community Law Enforcement Accountability Network.
September 1, 2021: Delighted to welcome Nithin Chalapathi and Shreya Shankar as new EECS PhD students!
August 13, 2021: Doris Lee and Devin Petersohn wrap up their dissertations on visualization recommendation and dataframe systems respectively. Congratulations Dr. Lee and Dr. Petersohn! Exciting times ahead!!
August 9, 2021: Devin Petersohn gives his dissertation talk on dataframe systems. Devin's work has laid the groundwork for how to reason about and optimize dataframe computation. Congrats Devin!
August 3, 2021: Stephen Macke wraps up his dissertation! Stephen's work on minimizing error while ensuring interactivity in a range of settings has been a joy to be a part of.
July 26, 2021: Tana Wattanawaroon defends his dissertation on supporting efficient computation in spreadsheets, titled "Generalizing Spreadsheet Computation for Evolving Spreadsheets at Scale". Congratulations Tana!
July 12, 2021: Our new PhD student, Shreya Shankar, is off to the races with MLTrace, a lightweight approach for instrumenting ML code to allow for end-to-end debugging, reproducibility, and introspection. Super exciting stuff; watch this space!
July 1, 2021: Our scalable dataframe system, Modin, has over 6000 github stars and over 1M downloads at this point. We're excited that there is so much organic traction and interest!
May 14, 2021: Our technical report on Lux is out! Lux has users across a range of industries at this point, including insurance, retail, and education, and thousands of github stars. Our paper introduces our always-on visualization framework, our lightweight intent language, and how we made Lux interactive when operating on large dataframes.
Apr 19, 2021: New Lux release; v0.3.0 supports a relational database backend, geovis, Jupyter lab, and matplotlib instead of altair, and more! Dashboard export (to streamlit, data pane, HTML, etc. forthcoming.) More here.
Apr 16, 2021: Doris Xin presents her dissertation talk on "Usable and Efficient Systems for Machine Learning"! Congratulations Dr. Xin! Doris is off to be a CEO at a new startup!
Apr 13, 2021: Doris Lee's paper in collaboration with folks at Tableau Research (Vidya Setlur, Melanie Tory) on developing a taxonomy for visualization recommendation was accepted at VIS 2021 via TVCG!
March 26, 2021: Doris Xin's paper in collaboration with Google folks on understanding production machine learning pipelines in TFX (along with my longtime collaborator Alkis Polyzotis and Hui Miao) was accepted as an industry paper at SIGMOD'21!
March 15, 2021: Our paper on leveraging think-time for opportunistic dataframe query evaluation was published at IEEE Data Engineering Bulletin and showcased at ML and Data Projects to know
February 15, 2021: Thrilled to see continued Lux adoption and interest. We had another industry blog post, yet another one, and hit over 700 stars on github.
January 15, 2021: Stephen Macke won the CIDR gong show for presenting his take on the next generation of notebooks! Devin Petersohn also presented his work on scalable dataframes.
January 10, 2021: Dual VLDB'21 accepts! Our paper on a general-purpose spreadsheet exploration tool, enabling zoom in/out, called NOAH, led by Sajjadur Rahman, was one. Our paper on NBSafety, a Jupyter kernel for safe notebook interactions, led by Stephen Macke, was another.
January 6, 2021: A virtual welcome to our new postdoc, Dixin Tang, who is joining us from U Chicago having worked with Aaron Elmore and Mike Franklin.
January 1, 2021: I took on the role of the Faculty Equity Advisor at the School of Information. See more here.
December 13, 2020: Our paper studying industry users of AutoML and how next-gen AutoML tools should look like was accepted at CHI'21, led by Doris Xin, Eva Wu, and Doris Lee.
December 9, 2020: Doris Lee passed her qualifying exam!
November 20, 2020: Much Lux love coming from industry. This LinkedIn post has nearly 8000 shares. This Medium article was written by another industry person to introduce Lux to the general public. Lux was also listed as a "Project to Know" here.
November 10, 2020: Our quest for safer notebooks continues with NBSafety, led by Stephen Macke! NBSafety automatically tracks lineage of variables in the notebook, providing suggestions to avoid executions of stale cells that could lead to incorrect or non-reproducible behavior. Lots of heavyweight program analysis techniques leading to a delightful, unobtrusive user experience. NBSafety was first presented at PLATEAU 2020, a PL-HCI workshop.
November 4, 2020: Congratulations to Doris Xin for passing her qualifying exam!
November 3, 2020: Congratulations to Stephen Macke for defending his thesis!
October 23, 2020: Enjoyed writing this perspective piece on the challenges and opportunities from working with domain experts on building interactive visualization tools, appearing at Patterns.
October 10, 2020: We've poured all of our experience building visualization recommendation tools over the years into Lux, led by Doris Lee. Lux provides in-situ visualization recommendations within notebooks as an add-on to dataframes instead of the standard tabular dataframe view. Doris demonstrated Lux at JupyterCon 2020.
October 7, 2020: Devin Petersohn wrote about how Modin helps data scientists be more productive here.
October 9, 2020: Congratulations to Tana Wattanawaroon for passing his preliminary exam!
October 3, 2020: Yay! Stephen Macke's paper on tighter confidence intervals for approximate query processing was accepted to ICDE'21!
September 14, 2020: Excited to have our paper on visualization recommendation for genomics out in the Patterns journal from Cell Press, led by Silu Huang (now at MSR) and Charles Blatti. Paper here. Our work on this project started as part of the NIH BD2K center a while back.
September 1, 2020: I presented a two-part keynote at the VLDB PhD workshop. One of the parts was on my experiences dealing with rejection; check out the recording here.
July 15, 2020: Our vision paper on scalable dataframe systems has been accepted at VLDB'20.
June 16, 2020: Our paper on ShapeSearch won the SIGMOD 2020 Best paper award, one of two awards out of over 450 submissions. It's amazing to see SIGMOD appreciate non-traditional usability-oriented work. Congratulations Tarique! Here's an (overly generous) article on the award.
June 10, 2020: We have open-sourced our spreadsheet benchmark, sheetperf.
June 9, 2020: I appeared on the Software Engineering Daily podcast to discuss Human-in-the-loop Data Analytics.
May 26, 2020: My former student Silu Huang, now a senior researcher at Microsoft Research, was awarded the Jim Gray dissertation award honorable mention, given to the second ranked dissertation in data management. Congratulations, Silu!
May 19, 2020: Slides and other materials from my Spring class on data engineering are available here. I squished the entirety of a traditional database class into the first half of the semester, focusing on user-facing aspects. I then covered non-traditional topics: json/doc stores, IR, spreadsheets, data frames, OLAP/vis, col. stores & compression, parallel proc., streaming/sketching, security & privacy, graph proc., contrasting with the relational approach. The key emphasis was on principles underlying data systems that data scientists may consider using, and how to pick between them for a given situation.
May 16, 2020: Congratulations to my Ph.D. student graduates! Tarique Siddiqui is off to be a Senior Researcher at Microsoft Research in the Databases group. Sajjadur Rahman is off to be a Researcher at Megagon Labs. Liqi Xu is off to be a Research Scientist at Facebook.
May 16, 2020: Congratulations to my M.S. student graduates! Angela Lee is off to Google, while Jaewoo Kim is off to AWS.
May 4, 2020: We released another version of our covidvis tool. We also released a manually gathered intervention dataset to go with the tool. Grateful to receive positive feedback for our efforts.
May 1, 2020: Two papers accepted to HILDA this year: Pingjing et al. lead a paper on results from the first lab-based user evaluation across a range of analytical tasks on spreadsheets, identifying roadblocks and opportunities. Doris et al. lead a paper on understanding ML development patterns from surveying a large collection of real-world ML traces.
April 15, 2020: Doris Lee participated in the CHI'20 doctoral consortium!
April 10, 2020: We released our first version of our covidvis tool to help make sense of the impact of interventions on the progress of the COVID-19. Collaborators include epidemiologists, statisticians, and public health folks.
April 1, 2020: SIGMOD paper acceptances: our paper on benchmarking spreadsheets, led by Sajjadur Rahman, and our paper on ShapeSearch, a multi-modality interface for querying for visual patterns led by Tarique Siddiqui.
February 12, 2020: Excited to be named one of the Alfred P. Sloan research fellows in Computer Science this year! Articles here, here, and here.
February 11, 2020: Preprint on Genvisage is live. Genvisage is a tool for explaining the results of genomics experiments; done in collaboration with Saurabh Sinha and Charles Blatti, led by Silu Huang.
January 28, 2020: Congrats Doris Lee for winning a Facebook fellowship! She's one of 36 fellows from over 1800 applicants. Article here.
January 26, 2020: A photo tribute to Hector Garcia-Molina, my advisor, here, who passed away earlier this year, assembled by his Ph.D. students.
January 24, 2020: The SIGMOD 2020 workshops list is live. Great list of upcoming workshops!
January 7, 2020: The first of many papers on the Modin project is live! ArXiV link here. Modin is aiming to make dataframes more scalable by applying database techniques, and is currently being used in a number of companies, with dozens of collaborators and lots of interest on github. Led by Devin Petersohn.
December 6, 2019: The Morning Paper, a popular blog that covers academic papers, covered our paper on spreadsheet benchmarking. Thanks Adrian!
November 13, 2019: A nice article on the new Data Systems & Foundations group at UC Berkeley, from the Division of Data Science and Information. Excited for what lies ahead!
September 27, 2019: A generous article on the VLDB Early Career Award, from the School of Information. Here's the paper I had the privilege of writing as part of the award.
September 9. 2019: Double Doris Xin whammy! Helix was mentioned as a "project to know" in a blog post from Amplify Partners. Thanks Sarah! And Doris was selected as a Hiedelberg Laureate. Congratulations!
August 26, 2019: Grateful to receive the VLDB Early Career Research Contributions Award for "developing tools for large-scale data exploration, targeting non-programmers." The award is given for research impact through a specific technical contribution of high significance since completing the Ph.D. (for up to 8 years post Ph.D.)
August 22, 2019: Please send us your most compelling SIGMOD workshop ideas for SIGMOD2020! Deadline September 27. Here's the call for proposals.
July 1, 2019: *BIG NEWS!* I moved to UC Berkeley, with a faculty appointment at the School of Information and the EECS Department; you can find articles about my appointment in EECS and the I School. Berkeley has been making some exciting moves in the broad data and information space, including a new Division of Data Science and Information, a very popular Masters in Information and Data Science program, and a new and increasingly popular data science major. Looking forward to being a part of this journey, and to moving back to the Bay Area! Illinois has been a wonderful home for the past 5-odd years with terrific students, a collegial research environment, and Midwestern charm (plus baby goats!); I'm going to miss my colleagues, the university, and the town terribly.
June 25, 2019: Our new paper describing our design study with Zenvisage was accepted at VAST'19 at VIS. Congrats Doris Lee!
June 20, 2019: Our vision paper on the data management and HCI challenges underlying AutoML was published in the IEEE Data Engg Bulletin.
June 20, 2019: Silu Huang defended her thesis on versioning-meets-databases. Congrats Silu!
May 1, 2019: Our proposal received a runner-up mention for the Facebook Probability and Programming Research Award.
April 11, 2019: Congratulations to the DataSpread team for a best demo award at ICDE 2019! The best demo awards went to two out of 24 demonstrations.
March 20, 2019: Silu Huang accepted her offer to be a researcher at the DMX group at Microsoft Research! I interned in the DMX group back in 2010 and it's a fantastic place to work.
March 15, 2019: DataSpread's asynch computation framework was accepted at SIGMOD'19. Congrats to Mangesh, Tana et al.!
March 1, 2019: Yihan Gao defended his thesis on data compression and data extraction, titled "Extracting and Utilizing Hidden Structures in Large Datasets". Yihan will return to his undergraduate alma mater to be an assistant professor at Tsinghua University. Congratulations Yihan!
February 10, 2019: Our demo paper on DataSpread's navigation, formula, and relational capabilities was accepted at ICDE'19. Congrats Mangesh, Tana, Sajjadur, and gang!
February 1, 2019: Mangesh Bendre is leaving the group to start as a research scientist at Visa Research. Congratulations Mangesh! Mangesh's thesis on DataSpread is up at this link.
December 15, 2018: Congratulations to Doris and gang for our IUI 2019 paper (link here) identifying a new fallacy in data exploration-the drill-down fallacy-and developing techniques to work around it.
December 1, 2018: Yay! My student Mangesh Bendre (coadvised with Kevin Chang) defended his thesis on DataSpread. Mangesh has spearheaded the development of DataSpread, and was instrumental in many of the key innovations so far: the hybrid data model, positional indexing, and asynchronous formula computation.
November 15, 2018: Congrats to Doris and the rest of the Helix team for the VLDB 2019 paper on the design of Helix, our human-in-the-loop machine learning system.
August 13, 2018: Congratulations to Silu, Liqi et al. (w/ Aaron Elmore) for a "best of conference" nomination for our VLDB 2017 paper on our versioned database system Orpheus's design and implementation!
August 13, 2018: Doris's IEEE D.E. Bulletin paper articulating our vision for a visual discovery assistant, called VIDA, will be out soon. Thanks to Alexandra for inviting us.
August 10, 2018: Shreya's paper on incorporating constraints for more accurate crowd-powered sorting was accepted as a short paper at CIKM 2018.
July 11, 2018: Thrilled to receive the Army Research Office Young Investigator Program Award for our work on decoupling perspectives in crowdsourcing. Thanks to the CS Department for the generous article!
June 30, 2018: Our SIGMOD blog post on why visual data exploration introduces a number of new data management challenges is up. Thank you to Georgia Koutrika for inviting me!
June 15, 2018: Our project page for Helix is up!
June 15, 2018: Short papers accepted at IDEA on iteration in machine learning workflows, and at HCOMP on quality evaluation methods for crowdsourced segmentation.
May 27, 2018: I am serving on the steering committee of the DSIA workshop @ VIS 2018. Consider submitting your latest and greatest work here!
May 2, 2018: Demos on Helix, our human-in-the-loop machine learning tool, and ShapeSearch, our flexible shape-based trend-line querying tool, were accepted at VLDB 2018.
April 25, 2018: Our paper on Needletail, an efficient sampling engine for browsing, was accepted at the HILDA Workshop at SIGMOD 2018. We've used Needletail in a number of papers on scalable approximate visualization generation, so we're glad to have this finally out there!
April 15, 2018: Our paper on accelerating human-in-the-loop ML, a vision paper for the Helix project, was accepted at the DEEM Workshop at SIGMOD 2018. Lots more to come on Helix in the near future. Congrats Doris!
April 1, 2018: Thrilled to receive the 2018 Dean's Excellence in Research Award from the University of Illinois, given to assistant professors with an outstanding research profile + impact. Delighted to be able to celebrate with the group (photo on the right)!
March 1, 2018: Happy to be recognized with a spot on the "List of teachers rated as Excellent" for the second year in a row!
February 15, 2018: Our demo paper (w/ folks at UChicago) on generating succinct diffs between data versions was accepted at SIGMOD 2018.
February 10, 2018: Mangesh's paper on data models and indexes for scalable spreadsheets has been accepted to ICDE 2018! This paper lays out the groundwork for our DataSpread project, many years in the making.
December 12, 2017: New paper on quickly identifying a succinct difference (or "diff") between two relational datasets here. We characterize the complexity of this problem, based on varying the classes of operators and types of attributes.
December 10, 2017: More Kelly news! Kelly received the Snap Research Scholarshop and the CRA Undergraduate research award honorable mention. Woohoo!
December 1, 2017: Interested in trying out our latest version of Zenvisage? Here's the link: zenvisage.cs.illinois.edu. More at our Medium blog post.
November 18, 2017: Paper studying scalability issues in Microsoft Excel by analyzing a large collection of Reddit posts, accepted at CHI 2018. Congrats Kelly Mack (an amazing achievement for an undergrad)! In other news, Kelly was also nominated for the CRA undergraduate research award.
November 10, 2017: Paper on our automatic data lake extraction tool accepted at SIGMOD 2018. Our tool automatically identifies the components corresponding to formatting and filters it out to extract a structured representation, with high accuracies on log files from github. Congrats Yihan and Silu!
October 15, 2017: My O'Reilly Blog post on "Enabling Data Science for the Majority" is live! In here, I articulate that there are 5 BIG challenges in democratizing data science, and describe some of our work as well as some of the other work in this space. Read this if you want to find out what's new and cool in data science research.
October 10, 2017: New preprint on characterizing the spectrum of scalability issues in Microsoft Excel via Reddit posts here as part of our DataSpread project. Led by Kelly, our intrepid undergrad!
October 1, 2017: The Zenvisage gang chronicle our multi-year effort in participatory design with Zenvisage along with scientists from material science, genetics, and astrophysics is chronicled here. Many interesting insights on how visual exploration systems like Zenvisage can fit into scientific data exploration workflows + many real instances of valuable scientific findings gained from the process!
September 11, 2017: Thanks to new funding from the NSF Algorithms in the Field (AitF) program, we can advance scalable visualization by applying sublinear time techniques, along with the super smart theory duo of Ronitt Rubinfeld (MIT) and Ilias Diakonikolas (USC). NSF page here.
September 1, 2017: VLDB Blog Posts! Here they are:
- "Towards Automating Insight", here.
- "Drawing Conclusions Early with Incvisage", here.
- "Painless Data Versioning for Collaborative Data Science", here.
- "Crowdsourcing in Practice: Our Findings", here.
August 15, 2017: Grateful to receive the C.W. Gear Junior Faculty Award from the University of Illinois! Thanks to the Department of CS for being such a supportive environment for junior faculty!
August 1, 2017: New/Updated preprints:
- on Needletail, our "any-k" browsing and sampling engine, here.
- on FastMatch, an algorithm for rapidly matching histograms to a target, applying a variety of systems and algorithmic ideas, here; a key component of Zenvisage.
- on DataSpread, studying representation and indexing schemes for spreadsheet data, here.
- on Catamaran (formerly known as Datamaran), our unsupervised extraction tool for large-scale extraction from data lakes, here.
May 18, 2017: The OrpheusDB demo received a best demo honorable mention! Congrats to Liqi + Silu! Missed it at SIGMOD? You can still catch it here: video.
May 15, 2017: Paper on IncVisage: our incrementally improving visualization algorithm and interface has been accepted to VLDB'17! Paper here. Joint work with theorists at MIT and Waterloo, and HCI/Viz folks at Illinois. Perhaps the first paper that has theory, DB, and HCI co-authors? (Would love to be corrected if not.)
April 15, 2017: Thrilled and honored to receive:
- The NSF CAREER Award: Abstract here. Excited to pursue the vision of optimizing "open-ended" crowdsourcing! Vision paper from IEEE Data Engg. Bulletin here.
- The TCDE (Technical Committed on Data Engineering) Early Career Award, awarded for an individual's whole body of work in the first 5 years after the PhD. The award citation: The award is for developing new interactive tools and techniques that expand the reach of data analytics, enabling powerful data-driven discoveries by experts and non-experts alike.
April 15, 2017: Orpheus Updates: demo accepted at SIGMOD 2017; paper accepted at VLDB 2017 (no revisions!); open-source release here.
April 3, 2017: The New York Times cited Adam Marcus and my book on crowdsourced data management. Article here.
April 3, 2017: The HILDA 2017 workshop (co-located with SIGMOD) program is up.
March 1, 2017: Manas's TKDE paper on smart drill-down (from the "best of ICDE 2016") was accepted.
February 20, 2017: Vision paper on next-gen visualization recommendation systems with Manasi Vartak, Sam et al. is out at SIGMOD Record. Link here.
January 30, 2017: My student Silu Huang won the MSR Faculty Fellowship: the first Illinois student since 2011! A great honor! Silu has been recently working on Orpheus.
January 30, 2017: Yihan's paper on calibrating classifiers has been accepted as an ORAL presentation at AISTATS'17!
January 15, 2017: Our paper analyzing a very large log of all tasks from a popular crowdsourcing marketplace has been accepted at VLDB'17. Learn all about how a marketplace operates, what the distribution of tasks look like, and how the workers behave here.
January 10, 2017: Three of our key analytics tools, DataSpread, Zenvisage, and OrpheusDB, are moving out of private betas with a few interested parties to the public, available for easy download and deployment. More details and download links here: http://tiny.cc/three-tools.
January 1, 2017: New preprint release on Catamaran (formerly known as Datamaran), our new fully-unsupervised data extraction tool from machine generated data: no examples or supervision needed! Preprint here.
December 1, 2016: Two new paper updates:
- Our paper on SlimFast: a data fusion algorithm, spearheaded by Theo Rekatsinas has been accepted at SIGMOD'17!
- Our vision paper on Open-Ended Crowdsourcing was accepted to appear at the IEEE data engineering bulletin, spearheaded by the amazing Tova Milo.
December 1, 2016: I've given a bunch of talks on our three tools for human-in-the-loop data analytics: a distinguished colloquium at Northwestern, a keynote at the Enterprise Intelligence workshop at KDD'16, and BigData events at Illinois and Chicago. Grab the slides here.
December 1, 2016: My exceptional PhD student, Silu Huang, was a finalist in the prestigious Microsoft Research PhD fellowship competition, with an in-person interview at MSR HQ -- so proud of her! Fingers crossed for the eventual outcome.
November 15, 2016: Many new preprints! Grab 'em while they're hot:
- From the Zenvisage project: a paper on visualizations that incrementally improve over time, and a paper on our rapid sampling engine for visualizations.
- From the Orpheus project: a paper describing data models and partitioning schemes for relational dataset versioning.
- From the Populace project: a paper on consensus-based clustering of unstructured data.
- From the DataSpread project: a paper evaluating representation schemes and indexing structures for billion cell spreadsheets.
November 15, 2016: We delivered our tutorial on crowdsourced data management at HCOMP'16: slides part 1 part 2.
November 1, 2016: Thanks to Adobe for supporting our research efforts!
November 1, 2016: New releases for : a paper on the Zenvisage query language, ZQL and our smart-fuse query optimizer, accepted at VLDB'17 here, plus a demonstration paper accepted at CIDR'17 here.
October 15, 2016: I am one of the chairs of the Human-in-the-loop Data Analytics (HILDA) Workshop at SIGMOD'17, along with the peerless Joe Hellerstein, from Berkeley, and Carsten Binnig from Brown. Website here. Follow us on twitter.
October 1, 2016: My outstanding MS student, Vipul Venkataraman won the Siebel Scholarship: cool cash prize of $20K. Well-deserved!
September 15, 2016: Participated in a fun panel on "Will AI eat us all?" with the eminent team of Sunita Sarawagi, Sihem Amer-Yahia, H. Jagadish, and Ihab Ilyas at VLDB'16. Short answer: no.
September 1, 2016: Thanks to NSF for funding our work on DataSpread with an NSF BigData grant. Some Illinois press here.
September 1, 2016: Participated in an invited workshop on the "Theory and Models for Crowds and Networks" with an eminent team of researchers in Oaxaca, Mexico. I presented a tutorial on the data management community's take on crowdsourcing. Slides here.
August 1, 2016: New slick websites for projects:
- , our spreadsheet-database hybrid: here.
- , our versioned database system: here.
- , our visualization recommendation system: here.
- , our project on optimizing crowdsourcing: here.
July 15, 2016: Our new paper on producing intelligent summaries of facets of papers, with Xiang Ren, Tarique Siddiqui, and Jiawei Han has been accepted at CIKM 2016!
June 20, 2016: Our paper on data exploration at ICDE 2016 was invited to the TKDE "best of conference" issue, an honor reserved for the top few papers at the conference. Great job Manas!
June 15, 2016: After two years of extensive collaborations with folks at the two institutes, I am now an "official" affiliate of the Institute for Genomic Biology, and the Beckman Institute for Advanced Science and Technology.
June 1, 2016: Our paper on Squish: a tool for compression of relational datasets was accepted at KDD 2016! Our code is open-source and available on Github.
May 1, 2016: New release on our visual data exploration platform zenvisage. Paper here, and website dedicated to Zenvisage here. Contact us if you'd like to test run zenvisage on your datasets!
April 15, 2016: We just received a seed grant from the Siebel Energy Institute to develop Zenvisage in collaboration with battery scientists at Carnegie Mellon! Excited to see what happens next.
April 10, 2016: We received a whopping 3X the number of submissions for the undergraduate research contest. Who knows what these young researchers will accomplish next?
April 1, 2016: Our paper on Decibel, the storage engine underlying DataHub, was accepted at SIGMOD 2016!
March 1, 2016: Thrilled to be among the "List of Teachers Ranked as Excellent by their Students" at Illinois! Happy to see that students enjoy my classes.
January 6, 2016: Adam and I are proud to finally release a book on crowdsourced data management, a labor of love under development for two years. The book not only covers the state of the art, but also contains a survey of both industry users of crowdsourcing and managers of crowdsourcing marketplaces. We hope that this book will be the definitive reference for how crowdsourcing is used in practice. Do send us comments!
January 1, 2016: Our vision paper on the unsolved challenges in large-scale data crowdsourcing was accepted at TKDE.
December 15, 2015: Our paper on interactive exploration using a more expressive drill-down operator was accepted at ICDE 2016 in Finland.
November 25, 2015: Some Illinois press on our NSF-funded DataHub grant. Thrilled and honored to be working with the amazing Sam Madden and Amol Deshpande at solving the problems underlying collaborative data analytics.
November 15, 2015: Our paper on optimally managing worker and answer quality in crowdsourcing was accepted at SIGMOD 2016.
October 1, 2015: We just heard word that NIH has funded our BD2K commons supplement. Looking forward to working with folks at UChicago to improve data publication workflows!
September 15, 2015: Student awesomeness: my student Silu Huang won the 3M foundation fellowship, while Tarique Siddiqui won the Siebel Foundation fellowship.
September 1, 2015: Thanks to the NSF, we now have funding to support research and development on DataHub via a Medium IIS grant with MIT and UMD! Link to the project page here.
August 1, 2015: The full SeeDB paper has been accepted at VLDB 2016 in India!
July 1, 2015: Our JellyBean paper on using humans to count objects in images will appear at HCOMP 2015!
June 9, 2015: Release of a new preprint on calibrating the output of confidence estimates from classification algorithms, using classical learning theory tools. This is work driven by my awesome student Yihan Gao.
June 6, 2015: Our DataHub query language proposal was accepted at TaPP, a focused provenance workshop.
June 1, 2015: Final tally for VLDB 2015 -- three papers and three demos on a variety of topics:
- papers: crowds, visualizations, and versioning;
- demos: data exploration, Excel-meets-databases, and collaborative data analytics.
May 27, 2015: Our paper on versioning principles was accepted at VLDB'15 without any revisions!
May 15, 2015: Undergraduate research news: Andrew Kuznetsov, a freshman working in our group won the ISUR undergraduate research prize, and Andrew with two other freshmen -- Andrew Thieck and Radhir Kothuri won the third prize in the Illinois Engineering Open House competition for their crowdsourcing tool.
May 12, 2015: Our paper on debiasing was accepted at KDD 2015!
April 9, 2015: Our first release of a new project, titled Data-Spread, with my esteemed colleague Kevin Chang and student Mangesh Bendre. Data-Spread is a tool that unifies databases and spreadsheets. You have to see it to believe it!
- Here is a YouTube video showing Data-Spread in action.
- Here is our demo paper on Data-Spread.
March 10, 2015: Four more new preprints in the last month! These were:
- our paper on SeeDB for query driven automatic visualization generation;
- our jellybean paper on counting objects in images; turns out we can do way better than humans or computer vision algorithms!
- our paper on debiasing of batches; crowdsourcing practitioners often use batching to save costs, but this can lead to non-independence: we deal with this issue.
- our versioning theory paper; to build a solid foundation for our DataHub project, we explored how to trade off storage and retrieval costs.
February 9, 2015: Our paper on exploiting correlations to avoid expensive predicate evaluations was accepted at SIGMOD 2015!
February 12, 2015: Many thanks to Google for their support via a Google Faculty Research Award! Excited to be building the next generation visualization toolkit.
December 10, 2014: Three new preprints in the last month! These were:
- smart drill-down, our tool for zooming into portions of a dataset quickly;
- our paper on globally optimal crowdsourcing quality management; and
- our paper on gathering data using the crowd, exploiting a hierarchy and MABs.
November 10, 2014: Three new paper acceptances in the last month!
- Our Datahub paper was accepted at CIDR;
- our rapid approximate visualization generation paper was accepted at VLDB;
- and our paper on generalized confidence intervals for crowdsourced workers was accepted at ICDE!
October 10, 2014: Thrilled to be a part of the new NIH BD2K (Big Data 2 Knowledge) center for revolutionizing genomic data analysis. Thank you, NIH, for the support!
September 2, 2014: We can finally talk about our exciting new project, titled Datahub (i.e., GitHub for Data) on collaborative data science and version management. The ambitious goal is to eliminate the pain-points of data book-keeping while doing collaborative data science.
September 1, 2014: Our paper on pricing for crowdsourcing tasks has been accepted for presentation at VLDB 2015! The paper studies a simple, but important problem: if you have a batch of tasks and a deadline, how should you vary price to meet the deadline?
August 25, 2014: Pleasantly surprised to be selected as the KDD dissertation award runner-up, having already been given the SIGMOD dissertation award! Feel truly lucky to have two communities - SIGMOD and KDD - supporting my work!
August 24, 2014: Had a blast being a keynote speaker at KDD IDEA 2014 - a big thank you to the organizers for inviting me! If this year was any indication, IDEA is going to flourish as a workshop for many years!
August 20, 2014: Our paper on optimally learning maximum-likelihood worker accuracies has been accepted as a work-in-progress paper for HCOMP 2014! The paper tackles the problem of worker quality estimation in a way EM-based algorithms cannot - by providing optimality guarantees.
August 15, 2014: Started at Illinois; exciting times ahead!

Synergistic Activities

I serve on the steering committees of HILDA (Human-in-the-loop Data Analytics) at SIGMOD and DSIA (Data Systems for Interactive Analysis) at VIS. Lots of excitement around this nascent area at the intersection of databases, data mining, and visualization/HCI - join us! I also served as the Faculty Equity Advisor at the School of Information for two terms in 2023 and 2021.

In the recent past, I was a co-chair of Workshops for SIGMOD 2020 and 2021. I served as the US Sponsor Chair for VLDB 2021; I also served as an Area/Associate Chair for HCOMP 2020, VLDB 2020, and SIGMOD 2020, as a Program Committee member for VLDB Demo 2019 and HILDA 2019 (phew!) I've served on the program committees of VLDB, KDD, SIGMOD, WSDM, WWW, SOCC, HCOMP, ICDE, and EDBT, many of them multiple times. I am serving on the program committee for VLDB 2023-2024.

Recent Releases

Medium Blog

Selected Projects

Lux: An always-on visualization recommendation system

Lux is a tool for effortlessly visualizing insights from very large data sets in dataframe workflows. Lux builds on half a decade of work on visualization recommendation systems.

Project page here.

Modin: A Scalable Dataframe System

Modin applies database and distributed systems ideas to help run dataframe workloads faster, with over 2M open-source downloads.

Project page here.

NBTools: Better Computational Notebooks

NBsafety and NBslicer make it easy for data scientists to write correct, reproducible code in computational notebooks.

Project page here.

Helix: An Accelerated Human-in-the-loop Machine Learning System

Helix accelerates the iterative development of machine learning pipelines with a human developer "in the loop" via intelligent assistance and reuse.

Project page here.

DataSpread: A Spreadsheet-Database Hybrid

DataSpread is a tool that marries the best of databases and spreadsheets.

Project page: here

Orpheus: Relational Dataset Version Management at Scale

DataHub (or "GitHub for Data") is a system that enables collaborative data science by keeping track of large numbers of versions and their dependencies compactly, and allowing users to progressively clean, integrate and visualize their datasets. OrpheusDB is a component of DataHub focused on using a relational database for versioning.

Project page: here

Populace: A Suite of Crowd-Powered Algorithms

Our work has developed a number of algorithms for gathering, processing, and understanding data obtained from humans (or crowds), while minimizing cost, latency, and error. Since 2014, our focus has been on optimizing open-ended crowdsourcing: an understudied and challenging class.

Project page: here