[R] Regression, Fixed Effects, and the Pre-Covid Comics Industry: Dead or Alive?

Tools used: R, Excel
Topic: Statistical Modelling with Hierarchical Models using comic book unit sales data
Search TLDR to skip process and find the results and a short summary.

Introduction

It is nearly impossible to find someone who was born before the 2000s who doesn’t at least consider comics to be a historic part of the entertainment industry. As time has advanced, however, we find that most people would be hard-pressed to mention comics as significant without structuring their points around the Marvel and DC cinematic and TV universes.

So as someone who was curious to understand the industry from a statistical perspective, I decided to download some data from Comichron to see what trends we could identify. Is the industry going down? How dominant is Marvel and DC? Which titular franchises add more to sales?

As is the case with any research in 2021, COVID plays a part. In this case, I’ll be looking at the data before COVID to see where things were headed before the new normal. The data is aggregated by Comichron and sourced from the distributor Diamond Comics, who almost unilaterally supplies comic book shops. I’ve pulled all data from 2007–2019: ’07 because it’s the year before the Iron Man movie release and ’19 because it’s the last year before COVID. I’d also like to highlight that the sales data from Comichron are to comic book stores, so it doesn’t mean that those issues were sold to actual customers.

This project was originally handed in as my final project to my Modelling and Representation of Data at the Duke University MIDS program taught by the great teacher Olanrewaju M. Akande, who I’d like to thank for the class. It was updated in 2021 using a lot I learned from Nick Eubank who is also a teacher at the program. They have not revised or approved this article and the opinions here are not their own or the program’s.

Lets see what industry trends we can find out with this data.

Fixed Effects and Initial Expectations

As with most data in the Entertainment industry we can access, most of the modelling work we can do relies on what we call Fixed Effects. FE are a way of separating your data into groups, in this case we have comics that are from Marvel, from DC, or other publishers. We can make indicator variables that tells us whether they belong to such groups and then estimate the average effect of belonging to that group. These effects as such are fixed and don’t change unless we separate them further.

Why is Entertainment data so reliant on this? A lot of success in the entertainment industry is correlated to the quality of the experience and as such can’t really be quantified, there is no variable for story quality. However, Fixed Effects can help us predict what variability isn’t coming from the quality of the product such as the impact of being from a big publisher or how much Batman carries a franchise.

Fixed Effects and splitting data into groups can allow us to make models closer to the truth and, unlike their name implies, this means we can make linear models with different intercepts and slopes (though that requires interactions which we will discuss later).

The above image shows how splitting your data with Fixed Effects can improve models (here reducing the distance from points to line). Of course, the split has to make real-world sense or your model is fictional.

This is important because most of the analysis will depend on knowing how to interpret Fixed Effects, but I will do my best to explain model results even if you skip this section.

Now to the setup.

Exploratory Data Analysis

If you’re a data scientist you know how important this is but I’ll spare most of those who aren’t the details.

From now, you can follow along with my github repo at
Comics/Comics.Rmd at master · joaomansur/Comics (github.com)
The write-up is more academic and includes a bunch of EDA if you’re into that. It also shows how I set up a lot of the variables if you’re into specifics.

Firstly, the Comichron data includes the following variables, and how I considered them for the model:

  1. Unit — The rank of the comic book in unit sales for that month (1 highest, 300 lowest) Since not all months had more than 300 I restricted analysis to top 300.
  2. Fused — This is a rank metric considering Card Stock and Regular Issues as a single one. Not used.
  3. Dollar — The same as above but rank in Dollar value, I assumed this is just Price times Estimated Units. Not used.
  4. Comic-book Title: Self-explanatory. I converted these later to Franchise based on whether they belonged to Spiderman, Batman, etc.
  5. Issue: The number of the issue, can be split into storylines or can be long-running titles with multiple storylines. Special issues can be Issue 0 or can have special numbers. I cleaned most of them by removing things like asterisks and other non-numerical characters.
  6. Price: Often a $X.99, with the most common being $3.99. I lumped all Non-Standard Prices not in X.99 or above 7.99 together.
  7. On sale: The date the comic went on sale.
  8. Publisher: The company responsible for publishing, often times from their own studio but there are times where publishers can represent creations that are not in-house.
  9. Est. units: Estimated units sold in the month. This is what we are trying to predict and our target variable. Unfortunately, they are units sold to stores not final sales to customers but they are the best we have.
A summary in R for the data with the two points below highlighted. Other variables I’ve coded beyond Comichron’s are there as well.

I’ll skip most of the EDA but I want to highlight two things you can see the summary printout above:

  1. DC and Marvel have between two and three times more comics published than the third largest Image Comics, and many times more than smaller publishers (all of those below top 6 don’t add up to half of Marvel’s units).
  2. Sales go from the thousands to near millions, but the median is around 13500 and the mean is around 22000. This means most comics don’t tend to sell hundreds of thousands and rather bunch up in the tens of thousands.

Point 2 is very important. If we plot this in a histogram we can see most comics bunch up around the lower ends of unit sales.

On the left a histogram of Unit Sales and on the right the same values distributed after taking their logarithm. Using hist() in R.

A heavily skewed dataset like that can lead to several problems in modelling. We can solve this by taking the logarithm of every value which you can see the result on the right, a much more normal distribution which we will use in analysis.

This change impacts coefficients of the model. This Log Transformation means that you must interpret coefficients by exponentiating them to find a multiplier (1+% impact). I’ll be doing this for us later but keep it in mind.

Feature Engineering for the Linear Model

If we see the variables available we see three ranks, comic titles, issue numbers, price, and published date. I made the dataset by labelling the month and year of each sale as well.

We wouldn’t want to use Ranks for the model, but I filtered comics to the top 300 using the Unit Rank so there wouldn’t be a bias for months with more data. This means that an initial model we can use would probably look like this in words:

We would use Publishers, Issue numbers, price, and month and year to predict the logarithm of estimated units sold.
or: XSold=Publisher+Issue+Price+Month+Year+Intercept

This model would essentially tell you: which publishers sell more, what an additional issue means in sales, what an increase in a dollar of price means, how each month impacts sales, and the change in sales every year.

This already sounds attractive but there are several domain differences we need to address.

Publishers

Let’s talk about publishers, there are too many publishers and we saw that Marvel and DC are major players. Including small publisher might increase the potency of the model but since a lot of them barely have enough data to make a reasonable assumption we can probably assume their coefficients won’t be useful. Without enough data, coefficients tend to overfit to what we do have and thus are not very generalizable.

My solution was to group small publishers and larger indie publishers together and give the top 7 their own variable using a factor like such:

While the top 7 receive their own group, the 9 larger indie studios are grouped together, these odd numbers were chosen by amount of total sales and my own personal interest.

This means that we won’t have too many publishers to look at and we can establish smaller indie studios as the baseline comparison.

The rest of the variables are better off as factors than simple numbers. The reason is that most of these numbers are really just different groups and often times changes in the value by 1 are not as predictive as grouping them.

Issue Number

Issues have a similar problem as publishers but there is also a dynamic we can capture with our model.

  1. First Issue sales might be different, often times they are the highest sale for a run.
  2. Some comics run for 2–12 issues in a mini-series in either planned or cancelled fashion. Any comics within 12 issues can be comfortably called a mini-series if they don’t continue.
  3. Many comics are long-running and can go up to the hundred issues or more. We can split issues at the 100 mark, anything beyond that is probably one of the old-school series still going.
  4. Some Issues are numbered 0 or something that isn’t a number, we can use that as a baseline.

The result is coding Issues as a factor, splitting comics into something kind of like first issues, mini-series, long-running, and classic. I won’t explicitly use these names but the Issues will be split into 0 or non-numerical, 1, 2–12, 12–99, and 100+. This will make checking First Issue impact and group comics by longevity rather than try to express the value of one more issue, which is logically not linear.

Price, Month, Year

We group Price similarly, since prices are mostly $X.99 we can just split them by the value they are using and leave the baseline as any non-standard pricing.

We can leave Year alone, after all changes in years by 1 interests us. I’ll use factors for Months because then we can isolate the impact of each month on sales if it interests us. Since there are often periods of special behavior for months like the holidays, this will probably help.

So now we can finally run the model we mentioned above but we are missing an important question: What about the heroes?

Franchises

We have to create a variable for the franchises of each comic book, but we only have the title. The solution I used is to search for the hero’s or rather the franchise name in comic book titles using the function below:

comics$Franchise[grepl("Spider-Man", comics$Comic.book.Title, fixed=TRUE)]<-"Spider-Man"

It searches in comic book titles for Spider-Man and then gives it the variable Spider-Man in the column Franchise. Some franchises like Batman required also searches for “Dark Knight” or other characters like “Joker,” this is also true for Star Wars that utilized a search for most character’s names.

Here’s the franchises I chose to isolate, the order will matter later:

-Marvel: Avengers, Black Panther, Black Widow, Captain Marvel, Dark Tower (Stephen King), Deadpool, Doctor Strange, Fantastic Four, Guardians of the Galaxy, Gwenpool, Hulk, Iron Man, Marvel Events, Ms. Marvel, Spider-Man, Star Wars, Thor, Wolverine, X-Men
-DC: Action Comics, Aquaman, Batman, DC Events, Detective Comics, Flash, Green Arrow, Green Lantern, Harley Quinn, Justice League, Superman, Watchmen, Wonder Woman
-Others: Walking Dead and Non-Franchise

These are chosen mostly from the latest issues I was seeing in 2019 or the characters that had movies out. Walking Dead was an Image comic consistently getting good numbers so I saw it worthy of a highlight.

Now this variable has a lot of subjectivity. I grouped some heroes and villains together (Batgirl and Joker in Batman) but left some out (Harley Quinn has her own). I fully understand this impacts analysis and can be a short-coming of the model but I tried to be as efficient as I could while highlighting the most recent franchises being given the spotlight in 2019. Feel free to check the code around line 70 here.

In any case, we now have a variable for each comic franchise and have engineered the other features as best we could.

Interactions

Interactions are two variables that we link together. With two Fixed Effects, it means that when both variables are active then we activate the interaction as well. This isn’t useful for Spider-Man and Marvel, because they are always together, but it can be useful if you want to find out whether Marvel or DC sell more when they do First Issue comics. This is because we can predict the average effect of Marvel on units sold or the effect of being a First Issue on units sold, but we need the interaction Marvel:FirstIssue to see the additional effect Marvel gets from First Issues over the average effects.

In other words if FirstIssues sell 20% more, but Marvel FirstIssues coefficient is 10%, then the total additional sale in First Issue for Marvel is 30%.

What interactions should we include for the model? In my academic paper I found a lot of interesting subjects in the EDA portion. I’ll later add the following interactions:

  1. Publisher:IssueFactor, Publisher:PriceFactor, Publisher:Year
  2. Franchise:IssueFactor and Franchise:Year
  3. Year:Month and Year:IssueFactor

What this means is that we will be able to see how Publishers and Franchises unit sales change depending on which Issue group and by year. We will also see if price effects sales differently by publishers, and how the impact of months and issues changes every year.

However, for this article I will only include the interaction between Publishers and years so we can see how their sales are changing compared to the mean over the years.

If I haven’t lost you so far, we can start making the model.

The Linear Model

Tossing what we engineered and have so far we are modelling:

The log of Estimated Units Sold as predicted by a baseline value (Intercept), the Publisher, what Issue group they’re in, the $X.99 Price tag, their Franchise, the Month they were sold in, and the change in sales per Year in general and for Publishers.

#Which is portrayed in my code as
lm(XSold~BigPub+IssueF+PriceF+Franchise+MonthF+Year+Year:BigPub,data=comics)

We start with a comic book from a small studio, with an unconventional price and issue number, without a big name franchise, sold in January 2007.

Our model will then give that an additional % if they can meet any different criteria and give it an additional % scale for every year after 2007 it was published in.

This model resulted in a adjusted R-squared of .44 which means we can predict around 44% of variance in sales (adjusted R squared also penalizes you the more coefficients you add). This shouldn’t be disheartening, it means that our model estimates that around 56% of variance in unit sales can come from marketing or quality of the product. That makes sense, right?

Here’s a printout of the coefficients, if you don’t feel like reading them I will highlight important points down below. Just know that the final column, P-scores, are starred based on how low they are. We consider below 0.05 as statistically significant, which means we can confidently say that the effect is not random and the coefficient is predicting behavior. For example, we would definitely not say Ms. Marvel and Gwenpool as franchises impact sales as both have a p-score of over .1. That’s lucky for Ms. Marvel which is one of the few franchises with a negative coefficient.

lm(formula = XSold ~ BigPub + IssueF + PriceF + Franchise + MonthF + 
Year + BigPub:Year, data = comics)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -50.977405 7.944087 -6.417 1.40e-10 ***
BigPubBoom -40.459281 12.546624 -3.225 0.001262 **
BigPubDark Horse 119.203278 10.846248 10.990 < 2e-16 ***
BigPubDC 48.094185 8.276894 5.811 6.26e-09 ***
BigPubIDW 82.700903 9.979959 8.287 < 2e-16 ***
BigPubImage -31.316346 9.259681 -3.382 0.000720 ***
BigPubLarge Indie 60.765968 9.669544 6.284 3.32e-10 ***
BigPubMarvel 40.178950 8.164990 4.921 8.64e-07 ***
BigPubValiant 225.904852 28.456249 7.939 2.09e-15 ***
IssueF100+ 0.291802 0.023112 12.625 < 2e-16 ***
IssueF13-100 0.210838 0.020471 10.299 < 2e-16 ***
IssueF2-12 0.119097 0.019881 5.991 2.11e-09 ***
IssueFFirst Issue 0.222949 0.020742 10.749 < 2e-16 ***
PriceF2.99 0.784167 0.040182 19.515 < 2e-16 ***
PriceF3.99 0.666716 0.038010 17.540 < 2e-16 ***
PriceF4.99 0.737723 0.040544 18.196 < 2e-16 ***
PriceF5.99 0.665016 0.055908 11.895 < 2e-16 ***
PriceF7.99 0.477300 0.062937 7.584 3.42e-14 ***
PriceFNon-Std Pricing 0.653997 0.039604 16.514 < 2e-16 ***
FranchiseAction Comics 0.884951 0.047501 18.630 < 2e-16 ***
FranchiseAquaman 0.564446 0.053714 10.508 < 2e-16 ***
FranchiseAvengers 0.448930 0.021863 20.534 < 2e-16 ***
FranchiseBatman 0.709745 0.019343 36.693 < 2e-16 ***
FranchiseBlack Panther 0.113219 0.057070 1.984 0.047276 *
FranchiseBlack Widow 0.031484 0.081390 0.387 0.698880
FranchiseCaptain America 0.349593 0.037364 9.356 < 2e-16 ***
FranchiseCaptain Marvel 0.133138 0.066559 2.000 0.045476 *
FranchiseDark Tower -0.060250 0.065931 -0.914 0.360805
FranchiseDC Event 0.772140 0.034835 22.165 < 2e-16 ***
FranchiseDeadpool 0.390548 0.031115 12.552 < 2e-16 ***
FranchiseDetective Comics 1.014822 0.045825 22.146 < 2e-16 ***
FranchiseDoctor Strange 0.352380 0.070500 4.998 5.81e-07 ***
FranchiseFantastic Four 0.193080 0.043139 4.476 7.63e-06 ***
FranchiseFlash 0.806282 0.040028 20.143 < 2e-16 ***
FranchiseGreen Arrow 0.536263 0.050876 10.541 < 2e-16 ***
FranchiseGreen Lantern 0.819862 0.030824 26.598 < 2e-16 ***
FranchiseGuardians OTG 0.373273 0.052945 7.050 1.81e-12 ***
FranchiseGwenpool 0.094918 0.117537 0.808 0.419345
FranchiseHarley Quinn 0.843777 0.056976 14.809 < 2e-16 ***
FranchiseHulk 0.227900 0.030746 7.412 1.26e-13 ***
FranchiseIron Man 0.224420 0.037513 5.982 2.21e-09 ***
FranchiseJustice League 0.847034 0.027452 30.855 < 2e-16 ***
FranchiseMarvel Event 0.333688 0.029358 11.366 < 2e-16 ***
FranchiseMs. Marvel -0.122753 0.092930 -1.321 0.186536
FranchiseSpider-Man 0.354782 0.021634 16.400 < 2e-16 ***
FranchiseStar Wars 0.577378 0.025308 22.814 < 2e-16 ***
FranchiseSuperman 0.779556 0.030531 25.534 < 2e-16 ***
FranchiseThor 0.429721 0.043526 9.873 < 2e-16 ***
FranchiseWalking Dead 1.116797 0.045321 24.642 < 2e-16 ***
FranchiseWatchmen 1.164378 0.104117 11.183 < 2e-16 ***
FranchiseWolverine 0.456071 0.028668 15.909 < 2e-16 ***
FranchiseWonder Woman 0.689889 0.038232 18.045 < 2e-16 ***
FranchiseX-Men 0.512668 0.021867 23.445 < 2e-16 ***
MonthF10-Oct 0.189987 0.015276 12.437 < 2e-16 ***
MonthF11-Nov 0.111348 0.015241 7.306 2.80e-13 ***
MonthF12-Dec 0.107509 0.015246 7.052 1.79e-12 ***
MonthF2-Feb 0.002918 0.015229 0.192 0.848034
MonthF3-Mar 0.032524 0.015230 2.136 0.032722 *
MonthF4-Apr 0.053632 0.015244 3.518 0.000435 ***
MonthF5-May 0.091418 0.015244 5.997 2.02e-09 ***
MonthF6-Jun 0.088872 0.015245 5.830 5.59e-09 ***
MonthF7-Jul 0.154031 0.015249 10.101 < 2e-16 ***
MonthF8-Aug 0.107467 0.015243 7.050 1.81e-12 ***
MonthF9-Sep 0.133991 0.015249 8.787 < 2e-16 ***
Year 0.029228 0.003949 7.402 1.36e-13 ***
BigPubBoom:Year 0.020123 0.006234 3.228 0.001247 **
BigPubDark Horse:Year -0.059048 0.005390 -10.954 < 2e-16 ***
BigPubDC:Year -0.023485 0.004115 -5.707 1.16e-08 ***
BigPubIDW:Year -0.040958 0.004960 -8.258 < 2e-16 ***
BigPubImage:Year 0.015704 0.004602 3.412 0.000644 ***
BigPubLarge Indie:Year -0.030132 0.004806 -6.270 3.64e-10 ***
BigPubMarvel:Year -0.019400 0.004059 -4.779 1.77e-06 ***
BigPubValiant:Year -0.111962 0.014120 -7.929 2.25e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6722 on 46711 degrees of freedom
Multiple R-squared: 0.4407, Adjusted R-squared: 0.4398
F-statistic: 511.2 on 72 and 46711 DF, p-value: < 2.2e-16

Like I said before, these numbers are predicting log-values so to make coefficients that make sense we can exponentiate them (using Euler’s number but it doesn’t matter for now) then subtract one and multiply it by 100 to get a percent.

Let’s look at the highlights:

TLDR: So what did we find out?

We came up with a model that took publisher names, grouped them by issue ranges, prices, hero franchises, months, years, and the changes for publishers over years. Which will allow us to understand some industry dynamics.

Spoilers: DC comics sell more units per comic and the industry is, well…

Not feeling too good but still there.

Let’s break it down.

Marvel vs DC

First off, the baseline comic which isn’t in any group has a tiny unit sales value. Tiny. A comic that belongs to Marvel sells this much more:

2.8*10¹⁹ %

Yeah. DC meanwhile:

7.7*10²² %

This astronomical number is how many times larger DC and Marvel are than the small Indies who go through Diamond. In a direct comparison, the DC brand is 2738 times more valuable than Marvel’s when it comes to unit sales of comic books.

However, Marvel sold around 26% more units than DC comics in our data (471 million to DC’s 373). Marvel also makes the list with 10% more comic books than DC. So what does it mean that DC’s brand is worth 2738 times more sales than Marvel’s?

It means that DC comic book unit sales vary around a much higher baseline value in sales than Marvel. If you consider the graph shown earlier

The Fixed Effect of DC comics places it higher like the orange line, as numbers vary around a higher value, while Marvel’s is lower. Though DC sells less in aggregate, their comics fluctuate around a higher number, while Marvel sells more units in aggregate but they vary around a lower number. Thus, the “brand value” for DC is higher per comic book.

That means in head-to-head comparisons, a normal DC comic should outsell an equivalent Marvel comic most of the time.

These values don’t mean much necessarily for the company health, it’s all just units to stores, but at least in order quantity per comic we can see the DC name mattering more.

Does this make sense? A possibility could be that Marvel franchises matter more, after all the true value of the Spider-Man brand is the Marvel coefficient times the Spider-Man coefficient.

However, we are about to see this is not the case.

Franchises

I made an infographic so this is easier for us. Here are the franchises in order of which impact sales more, blue for DC, red for Marvel, black for Image Comics:

Franchises and how many more units are sold in % if a comic is in their franchise. Black Widow, Ms. Marvel, Gwenpool and Steven King’s Dark Tower weren’t found to add any statistically significant addition.

We can see above that at least in our model, DC reigns supreme. Don’t be fooled by the Walking Dead’s franchise power compared to others because it is from Image Comics who even seems to have a negative brand when it comes to unit sales.

Besides that, we can see DC’s franchises consistently outperforming Marvel with only Green Arrow and Aquaman losing to Star Wars. So not even Marvel’s top players can outperform DC franchises nor do they get anywhere near the brand value of a DC comic. In fact, Marvel had the only franchises who didn’t show any statistically significant impact: Black Widow, Ms. Marvel, and Gwenpool. Even Captain Marvel and Black Panther were close to not reaching statistical significance.

It really does seem that when it comes to comics, DC just has more name value. I’m sure Marvel won’t be too concerned with all that Disney and movie money, but DC does have it better here.

Issues and Price

When comparing to our baseline (Issues 13–99), First Issues sell 25% more, Issues 2–12 sell 16% more, 100+Issues sell 17% more. This means that first issues outperform the others, and I’d really expect that it drops off from 2–12. If you’re wondering why Issues 13–99 is the baseline, it’s because it was absorbed alphabetically by the model code but luckily it was the lowest.

As far as Price is concerned, the 0.99 price seems to be unpopular as all others outperform it. It seems 2.99 and 4.99 are the best prices for unit sales.

Time

Firstly, our model sees a year-to-year growth in units sold to stores of 2.97% since 2007 which means it outperformed the US GDP consistently. So, that’s nice.

However, since 2007 both Marvel and DC have seen yearly decreases in orders compared to that average. Marvel’s interaction coefficient was -1.92% and DC’s was -2.32%. Which means that though other comics grow in unit sales by almost 3%, Marvel is growing by 1.05% and DC is only growing by .65%. This means that Marvel and DC unit sales grew less than the US GDP after 2010.

The number change is low and stagnant. It is clear that Marvel and DC are not growing substantially in their unit sales. Comic sales are not the US GDP and such percentage growth is not meaningful when you’re talking such low numbers. When movies explode in millions of sales and viewers, it’s hard to believe that the comic book industry is even close to being a prime entertainment industry.

What about Indie comics? Some are showing values above market but Dark Horse, IDW, and “Large Indie” companies are in the negative growth area.

It’s really sad for me but I have to conclude that the model shows that the comic book industry is stagnant when it comes to yearly growth. Since a grand majority of the market is DC and Marvel, and they are growing less than a percent a year, it is hard to conclude otherwise. With the COVID pandemic surely making sales worse, it is most likely that the situation has not improved in 2020 and 2021.

Lover of stories, Student at Duke MIDS