50 best public data sets for machine learning

50 best public data sets for machine learning

According to the information of github, Forbes, CMU official website and other information, foreign self-media mlmemoirs has compiled a list of the 50 best machine learning public data sets, and share it with you~

Author: mlmemoirs compiled by Guo Yipu ****

The foreign self-media mlmemoirs compiled a list of the 50 best machine learning public data sets according to github, Forbes, CMU official website and other information, and share it with you~\

Say something to know in advance:

1. Find the meaning of the data set

According to CMU, looking for a useful data set needs to pay attention to the following points:

The data set is not messy, otherwise it will take a lot of time to clean up the data.

The data set should not contain too many rows or columns, otherwise it will be difficult to use.

The cleaner the data, the better, and cleaning up large data sets can be time-consuming.

An interesting question should be preset, and this question can be answered with data.

2. Where to find the data set

  • Kaggle: Those who love competition should be familiar with it. There are various interesting data sets on Kaggle, such as ramen ratings, basketball data, and even Seattle pet permits.
  • UCI Machine Learning Library: One of the oldest sources of data sets, and the first stop to find interesting data sets. Although the data sets are user-contributed and therefore have different levels of cleanliness, most of them are clean and can be downloaded directly from the UCI machine learning library without registration.
  • VisualData: classified computer vision data set, you can search ~

Okay, here are the 50 data sets. Since some supplements were added later, the total number has exceeded 50.

 Machine learning data set


Sentiment analysis

Natural language processing

  • HotspotQA data set: a question-and-answer data set with natural, multi-hop questions, and strong supervision supporting facts to achieve a more easily interpretable question-and-answer system.
  • Enron Data Set: Email data from Enron's senior management.
  • Amazon reviews: Contains approximately 35 million reviews on Amazon over the past 18 years. The data includes product and user information, ratings, and text reviews.
  • Google Books Ngrams: A series of text in Google Books.
  • Blogger Corpus: Collected 681,288 blog posts from blogger.com, each blog post contains at least 200 common English words.
  • Wikipedia link data: The full text of Wikipedia, containing nearly 1.9 billion words from more than 4 million articles, can be searched by paragraph, phrase or part of the paragraph itself.
  • Gutenberg e-book list: Annotated e-book list in the Gutenberg project.
  • Hansards Canadian Parliament Text: 1.3 million sets of texts recorded by the 36th Canadian Parliament.
  • Jeopardy: An archive of over 200,000 questions from the Q&A show Jeopardy.
  • English Spam SMS Collection: A data set consisting of 5574 English spam SMS.
  • Yelp comments: Yelp, is the "Popular Dianping" in the United States, which is an open data set released by them, containing more than 5 million comments.

UCI's Spambase: A large spam data set, very useful for spam filtering.


  • Berkeley DeepDrive BDD100k: Currently the largest autonomous driving data set, containing more than 100,000 videos, including more than 1,100 hours of driving experience at different times of the day and weather conditions. The annotated images are from the New York and San Francisco areas.
  • Baidu Apolloscapes: Du Niang's large data set defines 26 different objects, such as cars, bicycles, pedestrians, buildings, street lights, etc.
  • Comma.ai: More than 7 hours of highway driving, details include the car's speed, acceleration, steering angle and GPS coordinates.
  • Oxford s Robot Car: This dataset comes from Oxford s robot car. It ran more than 100 times on the same road in Oxford, England in a year, capturing different combinations of weather, traffic and pedestrians, as well as buildings and Long-term changes such as road engineering.
  • Urban landscape dataset: a large dataset that records urban street scenes in 50 different cities.
  • CSSAD data set: This data set is very useful for the perception and navigation of autonomous vehicles. However, the data set is heavily biased towards the path of developed countries.
  • KUL Belgian Traffic Sign Dataset: More than 10,000 annotations from thousands of physical traffic signs in Flanders, Belgium.
  • MIT AGE Lab: A sample of more than 1,000 hours of multi-sensor driving data set collected at AgeLab.
  • LISA: The data set of the UC San Diego Intelligent and Safe Automobile Laboratory, including traffic signs, vehicle detection, traffic lights and trajectory patterns
  • Bosch Small Traffic Light Data Set: A data set of small traffic lights for deep learning.
  • LaRa Traffic Light Recognition: A dataset of traffic lights in Paris.
  • WPI data set: a data set for traffic light, pedestrian and lane detection.


  • MIMIC-III: The public data set of the MIT Computational Physiology Laboratory, marking the health data of about 40,000 intensive care patients, including demographics, vital signs, laboratory tests, drugs and other dimensions.

General data set

In addition to data sets dedicated to machine learning, there are some other general data sets, which may be interesting~

Public government dataset

  • Data.gov: This website can download data from multiple US government agencies, including all kinds of strange data, from government budgets to test scores. However, most of these data need further study.
  • Food Environment Atlas: Data on how local ingredients affect the American diet.
  • School Financial System: A survey of the American school financial system.
  • Chronic disease data: Chronic disease index data in various regions of the United States.
  • National Center for Education Statistics: Educational institutions and education demographics, not only the United States, but also some other places in the world.
  • UK Data Service: The largest social, economic and demographic data set in the UK.
  • Data United States: Comprehensively visualized United States public data.
  • National Bureau of Statistics of China.

Finance and economy

  • Quandl: A good source of economic and financial data, which helps to build models for predicting economic indicators or stock prices.
  • World Bank Open Data: Global demographic data, as well as a large number of data sets of economic and development indicators.
  • International Monetary Fund data: data published by the International Monetary Fund on international finance, debt interest rates, foreign exchange reserves, commodity prices and investment.
  • Financial Times Market Data: The latest information from financial markets around the world, including stock price indices, commodities and foreign exchange.
  • Google Trends: Data on Internet search behavior and popular news reports around the world.
  • American Economic Association: US macroeconomic data.


mlmemoirs: 50 Best Machine Learning common data set

Remarks: Some URLs need to be accessed scientifically to open.

What should I do if I don't have tools at hand? Collect it first!

Please follow and share /

Machine learning beginners/

QQ group: 654173748

Wonderful review of past issues\