# Categorical variables can be numbers

## Categorical variables in regression models

*Johannes Lüken / Dr. Heiko Schimmelpfennig*

Regression models are not limited to scale independent variables. Categorical variables such as gender, occupation, etc. can be taken into account if their characteristics are shown as numbers. Dummy coding is a common approach.

**Dummy coding of independent dichotomous variables**

The aim is to investigate what influence, in addition to the price, the placement of an advertisement has on monthly sales. The linear regression function is thus

Sales volume = b_{0} + b_{1}× price + b_{2}× advertising

While price is a metric variable, the advertisement has only two categories: an advertisement was placed (at the beginning of a month) or not. In order to take this influencing variable into account in the regression model, numbers must be assigned to both forms. If one follows the dummy coding, one reference category is given the value 0 and the other category the value 1. In this example, it makes sense to set the refrain from advertising as the reference category. The regression coefficient b_{2} then indicates exactly the amount by which the sales change due to the placement of an advertisement compared to the reference category “no advertisement” at a constant price.

**Dummy coding of independent variables with more than two categories**

It is also differentiated whether a TV or print advertisement was placed. In this respect, three categories can be distinguished. The coding of the two variables W (erbung) therefore requires_{1} and advertising_{2} (see Figure 1).

Figure 1: Dummy coding

The combination of the variables with the values W_{1} = 1, W_{2} = 0 thus represents TV advertising, W_{1} = 0, W_{2} = 1 print advertisement and W_{1} = 0, W_{2} = 0 no advertising. These combinations clearly define all three categories. A third variable W_{3} would not only be redundant, but would lead to exact multicollinearity, so that the regression model would not be estimable. "No advertising" is the reference category here too, since both coding variables are 0 for these. In the corresponding regression function

Sales volume = b_{0} + b_{1}× price + b_{2}× W_{1} + b_{3}× W_{2}

quantify b_{2} and b_{3} the effects of TV or print advertising on sales volume compared to the reference category. The difference between b_{2} and b_{3} indicates how much the effect of an advertisement differs between the two media.

**Effects of interaction with categorical variables**

The interpretation of the regression coefficients assumes that there are no multiple answers for the categorical variable. This means that it must not have been advertised in TV and print in the same month. In order to determine the effect of joint advertising in both media, a separate additional category “TV & Print” must be taken into account (see Figure 2).

Alternatively, TV advertising (yes / no) and print advertising (yes / no) can be understood as two separate dichotomous variables. If one assumes that simultaneous advertising in both media does not have an additive effect, there is an interaction effect between them. This can be mapped in the model by including the product of the two variables:

Sales volume = b_{0} + b_{1}× price + b_{2}× TV + b_{3}× Print + b_{4}× TV × print

The effect of a joint advertisement is then equal to the sum of the individual effects and the interaction effect (b_{2} + b_{3} + b_{4}). This corresponds to the regression coefficient of the coding variable W._{3} from Figure 2, if TV & Print advertising is presented as a separate category.

Figure 2: Dummy coding for multiple answers

An interaction effect between advertising and price, i.e. between a categorical and a metric variable, can also be taken into account in the model. In the introductory example of the dummy coding of the dichotomous variable advertising, your product with the price is included in the regression function:

Sales volume = b_{0} + b_{1}× price + b_{2}× advertising + b_{3}× Price × Advertising

Assume that the relationship between price and sales volume is negative (b_{1} <0), a negative coefficient means b_{3}that the effect of the price on sales is stronger when an advertisement is placed than without it. A positive coefficient, on the other hand, indicates lower price sensitivity.

In addition to dummy coding, effect coding and contrast coding are common procedures. The type of coding influences the regression coefficients and their interpretation. The coefficient of determination and thus the results of the tests of the significance of improvements in the coefficient of determination due to the consideration of further variables or of interaction effects are independent of this.

*Contribution from planning & analysis 13/5 in the "Statistics compact" section*

**Author information**

**Johannes Lüken**, Graduate psychologist, is head of the multivariate analysis department at IfaD, Institute for Applied Data Analysis, Hamburg. The focus of his work is the development of new methods, their implementation in analysis tools, as well as the application, training and advice with regard to these processes.

**Prof. Dr. Heiko Schimmelpfennig** is project manager for multivariate analyzes at IfaD, Institute for Applied Data Analysis, and Professor of Business Administration at BiTS, Business and Information Technology School, Hamburg. At IfaD, he is primarily responsible for the advice, application and training of these procedures and represents the field of quantitative methods in economics in teaching.

**literature**

Cohen, J .; Cohen, P .; West, S. G .; Aiken, L. S .: Interactions With Categorical Variables, In: Applied Multiple Regression / Correlation Analysis for the Behavioral Sciences, 3rd Edition, Mahwah, New Jersey, 2003, pp. 354-389.

Eid, M .; Gollwitzer, M .; Schmitt, M .: Multiple Regression Analysis, In: Statistics and Research Methods, 2nd Edition, Weinheim, Basel, 2011, pp. 648-677.

### The department

### Share

- What is gross sales
- Sarcastic comes from arrogance
- What did it cost to hear naysayers
- What are some good father and son songs
- Hilary Clinton was bisexual
- How important is GPA
- What is your rating of the math class
- Which decades of music do you like best
- Will ever leave Bioware EA
- How can a marginal product be negative
- What does color do
- Will someone steal my startup idea?
- Is dolerite intrusive or extrusive
- How do I analyze an image
- What are some life hacks for sale
- What is physical reality
- What is the postmodern view
- Which books should I buy for biochemistry?
- A theory that cannot be falsified can be true
- Which religion causes most wars
- Do you believe in a higher being
- How did Australia lose its ozone layer
- Why did Ayn Rand hate altruism
- Why does my fridge water taste bad