**Introduction**

There is little transparency in what hospitals charge patients and the rationale behind those charges. However, the recent creation of large data sets (CMS, Health Indicators, Healthdata.gov, Better Pill), may allow us to better understand what the drivers of healthcare costs are. Throughout the rest of this post, I will refer to those data as performance data

**Guiding Question**

Do hospitals that score higher on these performance data charge more for their services?

To make some headway in this, let's rephrase the question more mathematically. We have, broadly speaking, two classes of data:

- $\qquad \Rightarrow$ the prices for each procedure: let's call them $\Pi$
- $\qquad \Rightarrow$ the performance data: let's call them $P$

Our guiding question becomes: ** can the variance in $P$ explain the variance in $\Pi$ ?** That is, are differences in price due to differences in performance?

**Background**

We want to analyze the variance. But, leaping to *ANOVA*, the *AN*alaysis *O*f *VA*riance would be a misapplication of it. *ANOVA* produces the most reliable predictive models when applied to data generated by input (in our case the performance data) that are independent. We don't know whether our variables are independent. We can apply techniques, such as Principal Components Analysis, to create a version of our data with independent variables. But, before we take increasingly esoteric approaches, let's more explicitly characterize the problem.

**Features vs. Variables**

In data analysis lingo, *features* refers to the intangible qualities of the thing being studied. *Variables* are what we measure to quantify those features. For example, we cannot quantify the reliability of FedEx. We can quantify how many times a FedEx delivery truck is late or delivers the wrong package. The abstraction from variables to features is useful because it highlights relationships among variables that we might, otherwise, miss. This abstraction might seem like just so much verbiage when we are very familiar with the system. But, when we have a unstructured mass of data, our intuition is useless. We must discover how the observable variables relate to features and, if possible, name those features.

**Figuring out how features and variables relate**
Most approaches to modeling Big Data don't try to (at least explicitly) relate features to variable because discovering such a relation can be hit-or-miss. We can only automatically recognize certain relationships between features and variables. To be precise, we can only recognize features that arise from linear combinations of variables.In another post, I'll discuss why, with a little bit of preprocessing, this restriction becomes almost meaningless.