So what is this "Big Data" thing all about, and should I care?

he world is awash in data. In fact, the amount of data that gets created daily has increased so much that 90% of the data in existence today was generated and stored in the last 2 years. This is enabling new and deeper insights and spawning a concentration generally referred to as data science focused on getting the most from that data. My goal with this article is to provide a high-level overview of wha big data is and why it is getting so much press.

But first, why should we even care about data? Data helps us make better predictions, and I would argue the act of making a prediction and then observing what happens are at the core of learning and developing new insights. Typically your prediction would be informed by your past experiences and your beliefs about the future, and when you see how different the result was from your prediction you receive feedback on the accuracy of those beliefs. This feedback can then be used to update your understanding of the relationship between the information you have and the outcomes you want to predict. 

There are two ways "more" data can improve decision your ability to make predictions; the first is by having more observations, and the second is by observing more individual things. 

Having more observations can be thought of similarly as having more experience in a given field. More observations in your dataset can give you more confidence in the relationship between a descriptive variable and the outcome you care about. How varied the records are can be important as well; for instance, a bank with only datapoints that occurred during "good times" might be caught off guard in a recession bechause the outcomes would be fundamentally different than anything in their dataset might suggest. This is often referred to as long data as each incremental observation is stored as a row in the dataset.

Observing more individual things allows you to refine your prediction based on new factors. For instance, if you were trying to predict how many home runs a baseball player was going to hit in the upcoming season ou might look at how many they hit in previous seasons and have a pretty good prediction. However, if you also looked at their age ou could see if they were more likely to be increasing or decreasing in production. You could also incorporate whether they were playing through an injury in a previous season which, if that were the case would give you information that would help you understand previous season's performance more fully. This is often referred to as wide data as the new variables are often stored as columns in the dataset. 

In future articles I will address how to take better advantage of data generally and how to think about where a person or organization should focus their testing and learning.