Independent and Dependent Variables in Data Science
Terminology
In the field of Statistics, Machine Learning, and other areas of Data Science, the variables in a model can generally be divided into two categories.
Dependent Variables
A Dependent Variable is typically the variable that is expected to be the result or output of the model, also called Output or Predictor Variable.
Independent Variables
An Independent Variable is the input variable that affects the dependent variable, also referred to as Input or Explanatory Variable.
Explanation
Example
The average height of Korean women is 159.6cm, and the average height of Korean men is 172.5cm, and the ratio is approximately $\displaystyle 1.081 \approx {{ 172.5 } \over { 159.6 }}$. $$ y = y \left( x_{1} , x_{2} \right) = {{ 1.08 } \over { 2 }} x_{1} + {{ 1.00 } \over { 2 }} x_{2} $$ As a very simple example, consider a model that predicts the height of a son, $y$, based solely on the heights of the mother, $x_{1}$, and the father, $x_{2}$. The model simply scales the average height of the mothers to that of the fathers and then averages the result. To predict the height of a daughter, one would scale the father’s height to match the mother’s height.
- $y$ is the Predictor Variable that we want to predict in the model, since it is explained by the heights of the parents, $x_{1} , x_{2}$, thus they are called Explanatory Variables.
- Without the assumption that $y$ is dependent on $x_{1}, x_{2}$, this model would be meaningless, hence it is called a Dependent Variable. Independent Variables can be considered just the opposite term, not necessarily implying that $x_{1}, x_{2}$ are independent of each other, as they may not be mutually independent.
- Even if one does not fully understand the detailed meaning, it is certain that inputting $x_{1} = 169$ and $x_{2} = 174$ into the equation will output the height of a son as $y = 178.26$.
Notation
Usually, if not stated otherwise, independent variables are denoted by $X$, and dependent variables by $Y$, with $Y$ placed on the left-hand side. $$ \begin{align*} Y \gets & X_{1} , \cdots , X_{p} \\ Y =& X \beta + \varepsilon \\ y =& f \left( x_{1} , \cdots , x_{p} \right) \end{align*} $$ There are exceptions. In Time Series Analysis and other cases, dependent variables themselves become independent variables with a lag, so notations are not distinguished, and all data may be represented as $Y$.