So what is this "Big Data" thing all about, and should I care?

he world is awash in data. In fact, the amount of data that gets created daily has increased so much that 90% of the data in existence today was generated and stored in the last 2 years. This is enabling new and deeper insights and spawning a concentration generally referred to as data science focused on getting the most from that data. My goal with this article is to provide a high-level overview of wha big data is and why it is getting so much press.

But first, why should we even care about data? Data helps us make better predictions, and I would argue the act of making a prediction and then observing what happens are at the core of learning and developing new insights. Typically your prediction would be informed by your past experiences and your beliefs about the future, and when you see how different the result was from your prediction you receive feedback on the accuracy of those beliefs. This feedback can then be used to update your understanding of the relationship between the information you have and the outcomes you want to predict. 

There are two ways "more" data can improve decision your ability to make predictions; the first is by having more observations, and the second is by observing more individual things. 

Having more observations can be thought of similarly as having more experience in a given field. More observations in your dataset can give you more confidence in the relationship between a descriptive variable and the outcome you care about. How varied the records are can be important as well; for instance, a bank with only datapoints that occurred during "good times" might be caught off guard in a recession bechause the outcomes would be fundamentally different than anything in their dataset might suggest. This is often referred to as long data as each incremental observation is stored as a row in the dataset.

Observing more individual things allows you to refine your prediction based on new factors. For instance, if you were trying to predict how many home runs a baseball player was going to hit in the upcoming season ou might look at how many they hit in previous seasons and have a pretty good prediction. However, if you also looked at their age ou could see if they were more likely to be increasing or decreasing in production. You could also incorporate whether they were playing through an injury in a previous season which, if that were the case would give you information that would help you understand previous season's performance more fully. This is often referred to as wide data as the new variables are often stored as columns in the dataset. 

In future articles I will address how to take better advantage of data generally and how to think about where a person or organization should focus their testing and learning.

Short-Term Focus or The Discount Rate of the Stock Market

There is an interesting case study of how the recent US presidential election impacted the stock prices of firearm manufacturers counterintuitively. Below is a graph plotting the stock prices of both Ruger and Smith & Wesson in the weeks leading up to, and immediately after the recent election. Given that the incoming president would likely have a major impact on gun control in the US not only through policy, but also through appointing at least one justice to the Supreme Court it would make sense that the impact of the election would be large. What I find particularly interesting is that a Republican victory of both Congress and the Office of President seems like it would be less likely to result in gun control and in turn be good for gun manufacturers, but resulted in lower stock prices.  

I suspect it is non-controversial, but worth stating that the market value of an investment is what people are willing to pay today for cash flows throughout its future. People typically value cash flows nearer to today more highly than cash flows of the same size further in the future due to inflation, uncertainty in the future, and the opportunity for reinvestment. The rate at which someone is valuing future cash flows less than near-term cash flows is called the discount rate and usually expressed as a percentage (i.e. if I would consider $110 one year from now the same as $100 today my discount rate would be 10%).

An article published by the New York Times on spikes in background checks for the purpose of purchasing a firearm after major events that seem likely to increase the calls for more restrictive gun control helps shed some light on this. The data in the article show that gun sales spiked after 9/11, the first and second Obama elections (the second of which closely coincided with the Sandy Hook school shootings), and during a rash of mass shootings in December 2015 including the attack in Sen Bernardino. If the market was pricing in a spike in sales which would not materialize due to Republicans taking both Congress and the Presidency that would explain why the price of stock would decrease, but only if the long-term prospects were not impacted or did not matter in their valuations.  

An implication of this is that the market is valuing near-term cash flows so highly relative to long-term flows that extreme events, like tighter restrictions on gun ownership, don't materially impact the price of an asset. This has the positive benefit of holding managers accountable to delivering results and to provide grounded guidance on business performance, but runs the risk of discouraging investments with longer-term payoffs for more certain, near-term gains for publicly traded companies. If that risk materializes broadly it could lead to more volatility for workers as companies rise and fall more quickly and make public markets less attractive for innovative companies. 

Oil Prices, Developing Economies, and the Health of the Global Economy

The price of oil being positively correlated with the performance of stocks has been in the press quite a bit recently, primarily because it seems counter-intuitive to a lot of people. This seems to me to be driven by a shift in expectations around what will drive global economic growth in the medium-term. Namely, the continued growth of developing countries. 

First, we should test the hypothesis that the relationship between the price of oil and the performance of the stock market has changed. To do so, I compared the monthly change in the spot price of oil from Jan-1986 through Jan-2016 to the monthly change in level of the S&P 500 index, and repeated the comparison for Jan-1996 through Jan-2016, and finally Jan-2006 through Jan-2016. The results are summarized in the table below: 

Analysis Likelihood that relationship between changes in the price of oil and the S&P 500 is not random Percentage of the change in oil price reflected in the change in the S&P 500
1986 - 2016 39.36% (Effectively Random) N/A as relationship appears random
1996 - 2016 92.34% 5.97%
2006 - 2016 99.70% 17.22%

A couple of insights for me are that for at least the past 10 years the price of oil and the performance of the stock market have almost certainly been correlated and very positively so. With that in mind, the next question seems to be why would this be the case. The World Bank has the % of GDP made up by oil rents (think of this as the net profit of oil production) for the US at 0.9% for the period of 2011-2013 (the latest year data is available) and 0.4% - 0.7% for 1986-1990 so while oil production has become a bigger part of the US economy, that does not seem to explain most of the phenomenon.

All of this seems to point to two things; the first being that changes in demand for oil is a better indicator of changes in broad global demand and that has a direct, material impact on the health of the American economy that did not exist a couple of decades ago. Assuming that is the case I believe this changes the way we think about both our economic well-being as well as our place in the world more generally in the following ways:

  1. The return on the development of underdeveloped countries is shared between the developed world and developing countries more so than it was in the past
  2. As a result it makes sense that we  should invest both private and public funds to foster that development because if this is true for us it also true for other wealthy countries, and if we do not they will
  3. If other countries and enterprises make that investment they will reap both the economic rewards measured by this analysis, but will also likely win influence with countries that will make valuable allies over the long-run

Health Care Costs, Externalities, and the Failures of Markets

The classic example of markets failing to price things is where a product creates pollution. The idea is that the benefit of the use of the product is concentrated with the purchaser and supplier, but the negative impacts of the product or its use are spread across a broader population. For example, when purchasing gasoline for your car it is very straight forward for you to understand the benefit of being able to use your vehicle and compare that to the price being charged by the supplier. However, unless you are more altruistic than many you probably don't factor in the extra carbon and carcinogens in the air which will negatively impact the environment.

The recent news about investors purchasing drug companies and significantly increasing the price got me thinking about the opposite circumstance where there are significant benefits for the broad population that don't seem to be taken into account by the market. From a purely economical standpoint the tactic is purely rational. There are barriers to entry for many drugs even after the patent expires, especially in the near-term, and significant demand for the product which increases with the severity of the disease. Given only that you would expect the producer to have significant pricing power as demand for the product would likely be inelastic and substitutes would not be readily available. 

However, the price set by the transaction between the producer and the consumer ignores positive effects for the rest of the public including that the disease is less likely to spread, a healthier population is likely to be more productive, and economic mobility would likely increase as health issues tend to be concentrated in lower socioeconomic brackets. Measuring these benefits is difficult as highlighted in a recent joint article between the Harvard Business Review and The New England Journal of Medicine, however if we are interested in properly balancing resources to production and consumption of health care it is something we have to try to do.

Even armed with perfect data for the net size of the positive externality it is unclear about how best to act on that information. Subsidizing health care combined with regulation of health care costs could solve for the issue with perfect information and decision making based solely on that information. However, in a world of government lobbying and imperfect information the subsidy is likely to be larger than the societal benefit due to the concentration of suppliers as compared to overall population.

A Case Study of Apple & Customer Satisfaction

As I work on a data set to help answer some question from my original post on customer satisfaction. I wanted to to do a quick case study on Apple and their journey to differentiation themselves from other hardware manufacturers in the early 2000s.

The first thing I wanted to check was whether the change to customer satisfaction for Apple corresponded with an increase in their stock's performance. As you can see in the chart below the market began to value Apple much more highly as it differentiated itself from its peers. It is impossible to say whether the differential satisfaction caused the increase in valuations, was just a part for the increase, or was merely correlated, but if you accept the hypothesis that differential customer satisfaction drives differential performance the data are hard to ignore.

This relationship seems to hold true as well when you model the stock price indexed to 1995 against a couple of customer satisfaction metrics:

Independent Variable P-Value Significant?
Apple Customer Satisfaction Relative to Industry .0053 Yes
Apple Customer Satisfaction .00008 Yes

Assuming you accept the argument that Apple's differential performance was at least partially caused by the increase in customer satisfaction the next step is to try to understand what caused the increase in satisfaction and whether that insight can be applied to other businesses. Based on what I know of Apple my hypothesis was that the increased customer satisfaction was likely driven by a combination of product design, branding, and possibly the ability to get help and try the products at Apple stores. Below are the high-level proxy metrics I used for the modeling exercise. I also lagged the metrics and used transformations of them:

Potential Driver Proxy Metric(s) Notes
Product Design R&D Expense Directional over time as the impact of this probably has a long tail. Also, success rates of projects are important
Branding Advertising Expense Does not measure effectiveness of advertising in driving a brand that customer's like, but is at least a measure of investment
Apple Store Availability # of Apple Stores Missing data for 2002-2004; did a straight line computation from 2 stores in 2001 to 116 in 2005. This is probably a proxy for a renewed focus on the interaction between Apple and its customers

When modeling these metrics up to a two-way interaction with a target of Apple's overall customer satisfaction you get the following results:

Independent Variable P-Value Significant?
Interaction between Advertising Expense & the # of Apple Stores .00510 Yes(ish)
Advertising Expense .3406 No, but included in the model because it is a part of a significant interaction
Accrued Marketing Expense .0801 Yes(ish)
R&D Expense .0814 Yes(ish)
# of Apple Stores .000051 Yes

As you can see in the table above the data suggest there was not a single "silver bullet" in Apple's case that drove higher customer satisfaction, rather a combination of factors were correlated with the increase. Of course, this only proves that these metrics are correlated, but this is probably the best we can hope for until an organization is both willing to drive a negative experience and to share that they did that along with the data. 

A Framework for Thinking about Agriculture

Why build a framework for thinking about strategies and practices in agriculture?

As the population of the earth continues to grow it becomes more important to think broadly about the impact of decisions impacting agriculture, and the risks associated with those decisions. In order to structure that thinking I believe it is valuable to develop an overarching framework to compare decisions and analyses, which is what I hope to do with this article.

At its simplest one could make the argument that agriculture should work to feed people, provide income to producers, while minimizing negative impacts to the environment. However, if you take that argument to its extreme I think you quickly get to a place where everyone would be growing and consuming similarly efficient and nutritive things. While that would likely optimize for the framework above I believe that food is a cornerstone of many diverse cultural identities which would be negatively impacted by standardization. The fact that market decisions can both positively and negatively impact outcomes makes weighing recommendations more difficult, but also makes for an interesting problem to solve.

That said, my proposed framework for thinking about strategies and practices in agriculture is how those decisions impact:

Feeding Humanity + Cultural Identity + Utility Gained by Consumers + Profit for Producers – Risk ± Externalities.

Some aspects of the framework are relatively easy to measure, like how well the system is feeding humanity, while others are impossible to measure precisely and difficult to know the long-term impacts of, like risk or externalities. One of the easiest aspects to measure is the profits made by producers, and I would argue that factor has been the major driver of decision-making in modern agriculture. I believe this has led to both positive and negative outcomes. For instance, automation of production has increased the efficiency of food production lowering costs for consumers, and making it possible for more people to meet their nutritional needs, but it also tends to rely on monoculture (single crops covering large areas) which likely increases risk in the system as humanity becomes more reliant on a less diverse set of crops and animals.

It is also likely that the coefficients, or importance, of each aspect in the framework changes depending on the levels of the other aspects of the framework. For instance, if externalities, like air or water pollution, are relatively low, but there are a large number of people going hungry there is likely room to optimize the system by accepting higher levels of pollution to ensure food security for more people. That said, as the amount of negative externalities increase the perceived value of marginal cultural identity or producer profits would likely be impacted.

There are also systemic factors that can push agriculture away from a design that would optimize for the proposed framework. These include, but I am sure are not limited to, a concentration of both suppliers and producers, the short-term focus of investors, and markets inability to account for externalities.

A short discussion of risk in agriculture

One example of this is the use of chemical pesticides and fertilizers. While the use of these chemicals has increased the efficiency of food production it has likely come at the cost of the long-term health of farm workers as well as consumers. A study by the National Cancer Institute found that while farmers are generally healthier than the broader population they had elevated incidences of lymphoma, leukemia, and a multitude of cancers. Also, as Mark Bittman called out, we tend to allow the use of chemicals in agriculture either because we are not certain that they cause cancer or because we are not willing to disrupt production to reduce risk.

Some more examples where I believe we have focused on near-term returns at the cost of risk to the system are monoculture crop growth, and standardization of crop types. The New Yorker ran a great article a couple of years ago on bananas and a fungus that threatens them which now seems prescient as we are facing a shortage of this crop. There are 1000s of varieties of bananas, but more than 99% of exported bananas are of a single variety, the Cavendish. The Cavendish is great for export because it ripens slowly and doesn’t bruise easily making it ideal for long(ish) voyages to your grocer. However, a fungus called Tropical Race Four has been decimating Cavendish banana crops across the world, and since the system is reliant on a single variety there just is not enough variety for banana stocks to rebound.

A more frightening example of the risks of reducing the diversity of the genetics in people’s diets is the Great Famine in Ireland in the mid-1800s. There were a number of factors that led to a dependence on the potato crop by the Irish people, but when a blight decimated the potato crop the impact on individuals and the country were extreme. Scholars believe that 500,000 – 1,500,000 died as a result of the famine, and another 1,000,000 emigrated, mainly to North America. Historians also view this event as a watershed moment that led to Irish Republicanism, and may have contributed to later turmoil and the eventual independence of Ireland.

Consolidation of the industry

I believe that one of the main reasons we have ended up where we are is that the industry has consolidated to a point where the focus on near-term profits dominates most offerings. This is likely due to investors skewing their focus to near-term results as well markets typically pricing in the effects of externalities when they become apparent, rather than when they are “accrued”.

The consolidation in the industry can be seen in the graphs below highlighting the four-firm and eight-firm concentration (i.e. how much of global sales are concentrated at the top 4 or 8 firms), as well as the growth in sales for the industry as a whole vs. the growth in sales at the top four firms by size.

Source: USDA, Economic Research Service

Source: USDA, Economic Research Service

As you can see the largest firms in agriculture have consolidated control over much of the industry which likely drives some of the issues highlighted above. This is likely to remain the case as sales growth is concentrated at the largest firms, albeit mainly due to acquisitions.

My goal is to continue to expand on this framework and develop an understanding of where there is leverage against the different components of this framework. By gathering and developing information on these topics it is my hope that a constructive discussion can be had in the industry, and that agriculture can continue to develop to better meet the world’s needs. In order to do so I would like to get as much input and feedback on this way of thinking as I can to help inform my thinking, and where I should dig deeper.

Customer Satisfaction's Impact on Revenue for Airlines

This is the first in an ongoing series articles on customer satisfaction, and its impact consumers and businesses. I find the question about how much to invest in customer satisfaction and experience intriguing because at its surface it seems simple (customer satisfaction is good, right?), but it is very difficult to build a model describing how much you should be worried about different levels of satisfaction, and what you should be doing about it.

To start with I looked at the airline industry, because it has a reputation for being poor, and the firms in the industry have highly variable profitability. Furthermore, I consistently read stories about airlines like Spirit or RyanAir which have consistently poor customer satisfaction scores, but are profitable due to low ticket prices keeping planes full and then charging for any incremental service provided such as checking a bag, choosing your seat, etc.

My first goal is to look at customer satisfaction scores and sales metrics for airlines to determine if two things can be tied to one another for airlines specifically. Secondly, I am hoping that this analysis will provide some insight into a broader understanding of customer satisfaction more generally.

I started by looking at the average customer satisfaction for all of the airlines in the American Customer Satisfaction Index (ACSI; more information on it can be found here). I was surprised by how volatile the metric was, but also encouraged because I had a wide set of data.

From here I looked at the customer satisfaction  against revenue for each year / airline combination, and was excited when the relationship was statistically significant (p-value = .0012). Unfortunately, the coefficient of the regression is negative, and I believe the effect I am detecting is that customer satisfaction is different across firms of differing sizes as you can see in the chart below:

As you can see the relationship between customer satisfaction and revenue is negative, but when you layer on which airline corresponds to which data point it appears that the smaller airlines are actually the ones providing better service.

This poses a couple of interesting questions, some of them opposed to, if not contradictory to each other. First, do small firms do a better job at providing an experience that leads to higher satisfaction. Second, does it makes sense for firms to sacrifice customer satisfaction for other growth factors. Finally, looking at the chart above with the airlines associated with each data point it appears that revenue may have a positive relationship with satisfaction for a given firm, but that when you look at the industry as a whole the relationship flips. There are not enough data points for each airline to get really comfortable with any sort of read the analysis would provide, but I decided to give them a look in case it spurred any thoughts.

Of the three airlines I looked at, American, Delta, and Southwest, 2 did not have a strong correlation between customer satisfaction and revenue while Southwest did have a strong positive relationship (p-value = .0203) between satisfaction and revenue.

Airline Coefficient on Customer Satisfaction P-Value
American Airlines -0.1556 0.9997
Delta -312.7857 0.7239
Southwest 909.3220 .0203

If the only thing impacting revenue was customer satisfaction then we could conclude that each full point change in the customer satisfaction score would lead to change in revenue of about $900 million. It is more likely that customer satisfaction is only a part of the story, but given the potential impact of understanding the relationship between customer experience and business outcomes it is likely worth the effort.

The next steps/analyses are:

  1. Does firm size impact or correlate with overall satisfaction?

  2. What are the main drivers of customer satisfaction?

  3. Are there industries where the relationship between customer experience and business outcomes is clearer?