The awesome section presents collections of high quality datasets organized by topic.

Dec 2017: we’re just starting out and this is still “under development”. Please jump in and help us improve.

Machine Learning / Statistical

Machine learning is used as a general term for computational data analysis: using data to makes inferences and predictions. Interpreted broadly it includes computational statistics, data analytics, data mining and a good portion of data science.

Machine learning algorithms are often categorized as supervised or unsupervised (“data mining”).

  • Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly.

  • In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data.

  • Semi-supervised machine learning algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiringunlabeled data generally doesn’t require additional resources.

  • Reinforcement machine learning algorithms is a learning method that interacts with its environment by producing actions and discovers errors or rewards. Trial and error search and delayed reward are the most relevant characteristics of reinforcement learning. This method allows machines and software agents to automatically determine the ideal behavior within a specific context in order to maximize its performance. Simple reward feedback is required for the agent to learn which action is best; this is known as the reinforcement signal.

source

Datasets

There are a variety of machine-learning datasets on the DataHub under the @machine-learning account: https://datahub.io/machine-learning

  • Seismic Bumps: https://datahub.io/machine-learning/seismic-bumps. This is a classification problem. The data describe the problem of high energy (higher than 10^4 J) seismic bumps forecasting in a coal mine. Data come from two of longwalls located in a Polish coal mine.

Existing collections

Wealth, Income and Inequality

Qu: how have incomes of different social groups changed over time?

More specifically: Real income changes by quintile (plus top 5% and top 1%) in last 40+ years in the US?

US

The most widely used sources of data and statistics on household income and its distribution are the annual survey of households conducted as part of the Census Bureau’s Current Population Survey (CPS) and the Internal Revenue Service’s (IRS) Statistics of Income (SOI) data compiled from a large sample of individual income tax returns. The Census Bureau publishes annual reports on income, poverty, and health insurance coverage in the United States based on the CPS data,[2] and the IRS publishes an annual report on individual income tax returns based on the SOI.[3] While the Federal Reserve also collects income data in its triennial Survey of Consumer Finances (SCF),[4] the SCF is more valuable as the best source of survey data on wealth.

Each agency produces its own tables and statistics and makes a public-use file of the underlying data available to other researchers. In addition, the Congressional Budget Office (CBO) has developed a model that combines CPS and SOI data to estimate household income both before and after taxes, as well as average taxes paid by income group back to 1979.[5] Economists Thomas Piketty and Emmanuel Saez have used SOI data to construct estimates of the concentration of income at the top of the distribution back to 1913.[6] That work has been expanded recently to examine trends in wealth concentration.[7] CBO and Piketty-Saez regularly release reports incorporating the latest available data. source

Emmanuel Saez and Gabriel Zucman, “Wealth Inequality in the United States since 1913: Evidence from Capitalized Income Tax Data,” Quarterly Journal of Economics, Vol. 131, No. 2, May 2016, http://eml.berkeley.edu/~saez/SaezZucman2016QJE.pdf.

Climate Change

See Climate Change Page

Property Prices

Property prices including house prices

Interesting paper with long run time series across the world: http://voxeu.org/article/home-prices-1870 - however no data link afaict.

Inflation

Football

See Football Page

Linked Open Data

See LinkedOpen Data Page

World Bank

See World Bank Page

War and Peace

Data on inter-state conflicts, international relations and other correlates of war including war-related deaths and injuries.

See War and Peace - data on inter-state conflicts, international relations and other correlates of war including war-related deaths and injuries.

If you have suggestions, questions or feedback join our chat channel or open an issue on our issue tracker.