Ever since the first websites started implementing ratings and review systems, the entire practice felt somewhat off to me. I wasn’t yet working in technology, and hoped that the implementations might improve over time. Over twenty years later, I’m dismayed by how little change we’ve seen in the the review systems we use to help us inform decisions about what to buy, eat, read, or watch, or where to stay, fly, or spend our time.

These systems largely perform to the bare minimum of expectations and place the majority of interpretation work on the people hoping to use accumulated reviews to help make a decision. They are generally of the sort where you enter a numerical rating and a bit of text explaining why that rating was chosen. From there, an average is produced and featured as the subject’s rating, along with an ability to read some or all of the textual reviews people have written about the subject.

§

There are problems with qualification of the reviewer and quality of the review. Why should we listen to this reviewer’s opinion? How do you account for low-effort reviews? These are difficult problems to solve at scale, and they’re typically left up to the person reading the reviews with the intention of making a decision to solve.

On most websites that aggregate reviews, a reviewer is presumed qualified simply because they’re willing to offer their opinion, to provide fodder for the system. A consequence of having open feedback systems on the internet is what I’ll call the Boaty McBoatface problem, which is being vulnerable to manipulation.

The problems with spam reviews and manipulation of ratings therein have been around for a while, but the phenomenon of review bombing is applicable to anything on the open internet.

Then there is the problem of low-quality reviews, which tend to fall into three categories:

  1. lacking justification, such as “1 star - this sucks”

  2. a mismatched justification, which can range from “the shipping was really slow and it got here too late and my event was ruined” to “I ordered the eggplant parmesan at a pasta restaurant and it was terrible”

  3. mismatched expectations, which often looks like a mismatched justification, but semi-knowingly and with excuses, and comes off as entitlement on the part of the reviewer.

Many systems have added an is this review helpful? yes/no meta-review rating system to help with assessing the quality of reviews, but often this is only used to help sort the textual reviews and no insight is provided into how the meta-review system affects the subject’s score. Since these meta-reviews seem to only work at the meta-review level, they could but typically don’t help address my next two problems.

This can get into the question of what are the acceptable boundaries for a review? Giving an item purchased from a web shop a low rating because it arrived late seems unacceptable to me, but what about a motel with a high frequency of break-ins to cars in its parking lot? Rating a sushi place poorly because you didn’t like the tempura seems wrong to me, but what about rating a book based on the printing quality?

Propensity towards hyperbole and negativity bias can fall into the domain of quality as well. It is much easier to write a ranting 1-star or gushing 5-star review thann it is to take the time to write something more thoughtful.

These purportedly objective ratings system is starting to look like it relies on a lot of subjectivity.

§

There is also the problem of the review evaluator’s alignment with any particular reviewer on any particular subject. People have differing needs, tastes, and methods of evaluating things, and additionally these are often highly contextual.

You and I may agree about coffee, but have wildly different opinions about pizza. Perhaps the things I look for in places to stay are similar to one reviewer, and therefore their lodgings reviews are more relevant to me than most, but we look for radically different things in places to eat, so their reviews would be less relevant to me than most.

This is natural and we should account for it. College students and parents of newborns have entirely different restaurant needs from a couple seeking a romantic night out, or some friends looking to reconnect. It is left up to the person evaluating the places and the reviews to determine how much alignment they have in their needs and tastes.

This could be a place where the “is this review helpful?” meta-review system I described earlier could be used to help provide me with more relevant reviews on which to evaluate a particular subject or class of subjects, but — and I’ll revisit this theme later — since it’s unclear to me if that will actually help me, personally, as an evaluator, I have no incentive to do so.

The interaction of rating a review as helpful or not seems to be in the scope of one individual review on one individual subject, and so the act of rating a review is only useful in helping future evaluators. Without any benefit to our future evaluative selves, we might only reward particularly egregious good or bad reviews with feedback.

Such systems also rarely offer an ability to read through other reviews by a particular author, to help determine how aligned their reviews are with our own needs and tastes. Of course, this can become a huge privacy concern depending on the given review subjects.

The alignment between reviewer and evaluator often becomes apparent with regional travel recommendations. I am an unapologetic coffee snob — if I’m looking for coffee east of the Mississippi river and the results include Dunkin’ Donuts — you have failed at providing me with subjectively relevant reviews.

§

The main problem with these types of ratings systems is they attempt to be authoritative. This restaurant is the best, that restaurant is average, don’t even bother with that one.

I observed previously that what makes for a low-quality review can be subjective, and alignment between a reviewer and the person reading their review is based on their inherently subjective and contextual opinions, so why do the systems that aggregate these reviews pretend to provide an objective statement that this thing is four stars?

I believe this conciet is leftover from both the era before the internet when reviews would be published in established media outlets or individual reviewers with their own cultural authority, and simple review mechanics systems that are easier to implement.

What most review systems at major websites offer us is the illusion of authority, certified through averages. This movie is average. This restaurant is average. This hotel is average, this app is average, this waffle iron is average. Averages lie. We all know this, so instead we often look at the ratings distribution and then pore through reviews. Compared to what they could be, this style of review system is mediocre at best — two thumbs down.

That the ratings distribution often tells a more relevant story to our needs betrays the disconnect behind the pretend-authoritative rating score. Is this item a 3 because people either love it or hate it, or is it a three because it’s truly average with some outliers? With travel in particular, even the distribution can fail to paint a picture of relevancy, often due to differing regional expectations, norms, and tastes.

Of course, knowing what’s relevant for a particular evaluation of reviews is difficult, even for humans. Typically instead a review site such as Yelp or Travelocity, you might instead seek out a concierge or a trusted local friend — a subject matter expert, often with their own biases — who then fits their knowledge of available options to their understanding of your expressed states and desires.

Systems that attempt to offer authoritative ratings can’t, by their very nature, offer subjective recommendations. The rating and review that I would give now about my favorite local seafood restaurant is not the review that me-from-ten-years-ago would have given, and is not the same review that someone unadventurous and unwilling to try sautéed geoduck would give, nor should it be! Taste is personal, and it evolves over time, just like people do.

One might understandably think authoritative ratings are more useful the further you get away from personal preference and the more you get into utilitarian needs, which begs the question, with what sort of utilitarian instrument does taste become less relevant and when do personalized needs become more relevant? Cars? Hot water kettles? Airlines? Backpacks? Rechargeable batteries?

The problem of acceptable boundaries for a review also applies here — is it ok to rate a book poorly because of the printing quality? Is it ok to rate a hotel poorly because your car was broken into? Is it ok to rate a sushi restaurant poorly because they didn’t serve your favorite soda? That’s all subjective — maybe these things are important to some people, but not you.

How well can you define the problem you need to solve when searching through reviews? At what point when considering various options via their reviews do the considerations for individual needs look different for the considerations for taste? How could one begin to even express all of this?


§

For all of my complaints above, I will also freely admit to rarely writing reviews for things. When I do, perhaps I’ve had a bad experience and want to vent — the 1-star hyperbolic rant. I have occasionally posted positive reviews for particularly good experiences, sometimes to help a fledgling business or product I like get some momentum, or to counterbalance negative reviews I disagree with.

Mostly however, I don’t write reviews because I am by nature a reflective person — I would prefer to give a thoughtful assessment of something. For purchased items, this often takes some time having lived with it; for restaurants, unless something was egregiously negative, it might take a few experiences there.

Crafting a good paragraph or two which well-articulate why you think something is work, which brings me to the other reason I typically don’t write reviews: There’s often not much in it for me. Maybe someone else will use my opinion to make a decision, maybe not — but will this act of review be used as a signal to help me with evaluations in the future? It sure doesn’t seem like it.

It would be really nice if, based on my reviews for coffee, I didn’t get recommendations for Dunkin’ Donuts when I travel to the east coast. When I rate one restaurant higher than a categorically similar one, it should learn from that. Who else do I have alignment with on those ratings, and how else do we agree? It seems like that kind of signal would be really useful in helping make decisions. The benefit to me for rating something should be to in-turn receive better recommendations.

Achieving something like this would be hard, but I believe it’s possible, and a less wasteful use of applied machine learning than bullshit generators. It’s hard in part because you have to mathematically figure out how to group people who like places like my local Olympia Coffee versus places with followings like Dutch Bros. Coffee, and hard in part because helping people define their needs and expectations is hard.

In the meantime, I’m hoping we can see a resurgence of curated recommendation lists. Ratings systems continue to disappoint, and I don’t believe the large players like Amazon or Yelp have any incentive to improve them.


Tending Notes

I am finally posting this after letting it languish in my notes for over two years, thanks to the Digital Garden effort. I have more to say, touching on Jobs to be Done theory, and the increasing prevalece of companies asking customers to review their employees. I also want to gather some links for a “further reading” section like I have done in other essays here.