The proliferation of sources for game reviews over time have brought many benefits to people in search of consensus opinions. Unfortunately, because of bias and other factors, more reviews can also bring more noise, especially for Metacritic.
http://www.nintendoworldreport.com/editorial/44420/the-positives-and-pitfalls-of-review-democratization
The release of Legend of Zelda: Breath of the Wild certainly brought out a lot of opinions. Many of them were extremely positive, but there was obviously both a reflexive backlash of “artificial” reviews (created with the express purpose of simply trashing the game) as well as “reviews” whose goal may not have been to objectively review the game but instead to act as an expression of defiance for specific communities. While the outright bogus reviews tended to be in the user community space, disqualifying them from being counted in aggregation scores on Metacritic, what created a little more of a stir was that a small number of less-than-stellar reviews also got counted, bringing down the overall average for the game in the all-time rankings. All of this brings us to the point of this editorial, which is to explore where things were at earlier points in time, where they are now, and then some of the potentially difficult questions that may need to be asked concerning what reviews (both good and bad BTW) merit being “counted” at the end of the day.
Before getting too far into things lets get one thing out of the way very quickly. All people are entitled to their opinions, no matter how they may be formed or influenced, and they even deserve for those opinions to be shared and read (whether you decide to care may be a different matter). In reviewing games, in particular, what qualifies someone to review a game has obviously reached a pretty low bar. People no longer need a journalistic platform to publish their thoughts on, reviews are solicited almost everywhere and, again, that's fine. The trickier part of things is tied to aggregation and whether you're interested in the raw mathematical average or whether you want it to be set up in a way to at least attempt to be accurate. If you read the content of a review you're often able to quickly discern whether it is the ramblings of either a fanboy or hater and you can pretty well throw those out on both ends. If all you're dealing in is the final numbers that gets a bit hairier. Keep in mind, as well, part of the reason how these scores are averaged matters is because literally peoples' careers and corporate decisions can be informed by them as repeatedly it has been noted some publishers look at the metacritic scores to “grade” development efforts. So this conversation is for more than just abstract interest in accuracy.
Getting on to where we once were, back in the stone age there were pretty well only mainstream print publications to go by. You could check the scores in the latest EGM, GamePro, Next Generation, and a host of others... and that was roughly it. Sure, at some point word of mouth would get rolling but the overall lack of variety in reviews made it tough, sometimes, to be confident. Worse, these publications would sometimes print reviews without attribution or they would be a product of collective opinion, possibly robbing readers of getting an honest feel for a particular reviewer's style or preferences. Even if you did know who wrote the review since almost all publications would only post one review per game even that connection could end up worthless if someone you didn't know/like/trust was the one who did the review that month. In short, looking back, going back to that era wouldn't be preferred.
Progress into the early internet age and things began to get a little more interesting and served as a sort of preview for where we are now. Independent game networks and fan sites of various kinds began to crop up, some with more polish than others, but the benefit was an increase in volume and diversity of opinion. Especially in the early days most independent sites weren't getting free games, the opinions being offered were from fellow gamers just like anyone else who had spent their money and were going to share their thoughts whether good or bad, in some ways improving their authenticity. Of course sometimes this would come at the cost of consistency or perspective so, in particular, fan sites could skew heavily positive at times. But since most sites would post multiple reviews you could at least have the advantage of diversity even within the staff of a website and you could come to value the opinions of specific reviewers as well.
This cascades into the modern day, and the point where things get both complicated and a bit overwhelming. There's just a load of opinion out there, plain and simple. Some of it is still in the mold of the integrity exhibited in the classic print publication space, whether private or professional, and a ton of it isn't. The great thing is that this climate amounts to there being reviews to fit all tastes and temperaments. If you can't find someone with a review style and track record you generally agree with you probably aren't looking hard enough. With the climate being what it is perhaps if this is the case what you should be doing is writing your own reviews to establish your own following... it really is just about that crazy anymore. But that same diversity and craziness is where the question begins to crop up with which reviews, at the end of the day, deserve to be counted. This is where things can get a bit ugly.
We'll start with a reviewer who shall not be named (if you read my opinion on the review it will be clear why I'm not doing it) and a particular score given for Breath of the Wild that did get counted. Again, I won't question the overall principle that people are entitled to stating their opinion, but I also disagree with that specific review score being counted. The complication with this type of reviewer is that rather than being from a school of measured thought first and foremost they're focused on the persona they project to their fan base. The new-ish creation in this generation is the “personality” reviewer, people with a shtick and a following that is driven more by the adherence to that gimmick than necessarily being accurate. In this specific case I'd argue that it wasn't brave to score the game low, it was actually self-serving and a manufactured score that would appease the rabid community who loves contrarian and anti-establishment grandstanding while not scoring things so low as to lose any hope of legitimacy. Let's face it, it was a very “safe” score to give for all of the hub bub... bravery would have been making it higher or lower.
To be fair, though, if we begin being critical of the lower and possibly troller-ish end of the spectrum we also need to take a hard look at the people who may be skewing things up unnaturally, in the end doing just as much damage (if not more since they're probably more prevalent). Are there really “legitimate” reviews out there for 1-2-Switch that are over 8? Or even as high as 7? Really? This game would be considered worthy of what would be considered a “passing grade” in peoples' traditional thoughts? I'll probably write something else in the future about how value versus purchase price really need to factor much further into modern review scoring but at the point this game gets higher reviews the legitimacy of them should be pretty severely questioned. This all ultimately means you're combating a problem both from the bottom and from the top.
That gets us into the last phase, trying to determine what could and should be done to aid in making the aggregated scores more “accurate”. Even among publications or sites that are considered to be “legitimate” I think a strong case could be made that rather than attempt to determine, as a whole, whether to add/remove a site or reviewer across the board it would be easiest and best to do what many places do when determining averages: Throw out both the top x and bottom x (whether this should be a set number or percentage, and what that number should be are up for debate). It would, in theory, mostly balance itself out if all reviews were 100% legitimate, doing no real harm, but it would likely prevent severe outliers from skewing things up or down. If enough people think a game is great or stinks obviously removing that number wouldn't change a thing, you're only removing individual reviews from the average and if enough people agreed on a specific numeric score the majority of them would still stand.
Barring some other type of standard adjustment of this kind the only option I would see possible is, again, going to a need to continuously evaluating whether individual reviewers or outlets should be considered legitimate, pretty well an impossible task and one that would generate far more controversy than it is worth (especially given the tendency of “personality” reviewers to be a tad dramatic with a mentality of “forget the games, just focus on me”). Also, though on any given review an outlet or individual reviewer may skew up or down quite a bit (I would have likely been eliminated back in the day because I pretty well detested the original Tomb Raider games by likely 2 or more points below the average) I would guess most of the time they likely would have scores that are a little closer to the norm. I understand that metacritic attempts to help with an adjustment based on their “weighted” score of individual outlets but honestly that method is even more prone to issues if one of their highly-weighted outlets turns in a skewed review somehow, making the problem worse. Besides, don't you think if people found out how those weights were determined people would likely begin to nitpick even that? I know I probably would.
At the end of the day the fact that all games currently scored by Metacritic have all had the same criteria applied to them (sort of) makes their aggregated scores "fair". However, looking over the top 5 ranked games of all-time, the fact that the most recent of them was from almost a decade ago likely isn't a coincidence. The fact is that as more reviews are added to the mix, especially considering the diverse voices that can be out there, the more uncertain it is how you'll necessarily break out from the pack. With those added voices and numbers also comes the probability of baggage, both high and low, coming along for the ride, further complicating getting an accurate picture of things other than by hoping that the sheer force of numbers will help average things out. But when you see a variance of 3 – 4 points or more from your highest score to your lowest score perhaps something is up there that, in the end, may not be worth counting. It may be silly to worry over it but if the goal of an aggregated score is to be accurate, this sort of adjustment would seem to at least be more honestly set for meeting that goal.