21: An Amazon-style rate and review model

July 29, 2018protohedgehogPeer Review

This is adapted from our recent paper in F1000 Research, entitled “A multi-disciplinary perspective on emergent and future innovations in peer review.” Due to its rather monstrous length, I’ll be posting chunks of the text here in sequence over the next few weeks to help disseminate it in more easily digestible bites. Enjoy!

This section outlines what would a model of Amazon-style peer review could look like. Previous parts in this series:

——————————————————————————

Amazon (amazon.com/) was one of the first websites allowing the posting of public customer book reviews. The process is completely open to participation and informal, so that anyone can write a review and vote, providing usually that they have purchased the product. Customer reviews of this sort are peer-generated product evaluations hosted on a third-party website, such as Amazon (Mudambi & Schuff, 2010). Here, usernames can be either real identities or pseudonyms. Reviews can also include images, and have a header summary. In addition, a fully searchable question and answer section on individual product pages allows users to ask specific questions, answered by the page creator, and voted on by the community. Top-voted answers are then displayed at the top. Chevalier & Mayzlin (2006) investigated the Amazon review system finding that, while reviews on the site tended to be generally more positive, negative reviews had a greater impact in determining sales. Reviews of this sort can therefore be thought of in terms of value addition or subtraction to a product or content, and ultimately can be used to help guide a third-party evaluation of a product and purchase decisions (i.e., a selectivity process).

Amazon’s star-rating system.

Star-rating systems are used frequently at a high-level in academia, and are commonly used to define research excellence, albeit perhaps in a flawed and an arguably detrimental way; e.g., the Research Excellence Framework in the UK (ref.ac.uk) (Mhurchú et al., 2017; Moore et al., 2017; Murphy & Sage, 2014). A study about Web 2.0 services and their use in alternative forms of scholarly communication by UK researchers found that nearly half (47%) of those surveyed expected that peer review would be complemented by citation and usage metrics and user ratings in the future (Procter et al., 2010a; Procter et al., 2010b). Amazon provides an example of a sophisticated collaborative filtering system based on five-star user ratings, usually combined with several lines of comments and timestamps. Each product is summarized with the proportion of total customer reviews that have rated it at each star level. An average star rating is also given for each product. A low rating (one star) indicates an extremely negative view, whereas a high rating (five stars) reflects a positive view of the product. An intermediate scoring (three stars) can either represent a mid-view of a balance between negative and positive points, or merely reflect a nonchalant attitude towards a product. These ratings reveal fundamental details of accountability and are a sign of popularity and quality for items and sellers.

The utility of such a star-rating system for research is not immediately clear, or whether positive, moderate, or negative ratings would be more useful for readers or users. A superficial rating by itself would be a fairly useless design for researchers without being able to see the context and justification behind it. It is also unclear how a combined rate and review system would work for non-traditional research outputs, as the extremity and depth of reviews have been shown to vary depending on the type of content (Mudambi & Schuff, 2010). Furthermore, the ubiquitous five-star rating tool used across the Web is flawed in practice and produces highly skewed results. For one, when people rank products or write reviews online, they are more likely to leave positive feedback. The vast majority of ratings on YouTube, for instance, is five stars and it turns out that this is repeated across the Web with an overall average estimated at about 4.3 stars, no matter the object being rated (Crotty, 2009). Ware (2011) confirmed this average for articles rated in PLOS, suggesting that academic ranking systems operate in a similar manner to other social platforms. Rating systems also select for popularity rather than quality, which is the opposite of what scholarly evaluation seeks (Ware, 2011). Another problem with commenting and rating systems is that they are open to gaming and manipulation. The Amazon system has been widely abused and it has been demonstrated how easy it is for an individual or small groups of friends to influence the popularity metrics even on hugely-visited websites like Time 100 (Emilsson, 2015; Harmon & Metaxas, 2010). Amazon has historically prohibited compensation for reviews, prosecuting businesses who pay for fake reviews as well as the individuals who write them. Yet, with the exception that reviewers could post an honest review in exchange for a free or discounted product as long as they disclosed that fact. A recent study of over seven million reviews indicated that the average rating for products with these incentivized reviews was higher than non-incentivized ones (Review Meta, 2016). Aiming to contain this phenomenon, Amazon has recently decided to adapt its Community Guidelines to eliminate incentivized reviews. As mentioned above, ScienceOpen offers a five-star rating system for articles, combined with post-publication peer review, but here the incentive is simply that the review content can be re-used, credited, and cited. Other platforms like Publons allow researchers to rate the quality of articles they have reviewed on a scale of 1–10 for both quality and significance. How such rating systems translate to user and community perception in an academic environment remains an interesting question for further research.

Reviewing the reviewers.

At Amazon, users can vote whether or not a review was helpful with simple binary yes or no options. Potential abuse can also be reported and avoided here by creating a system of community-governed moderation. After a sufficient number of yes votes, a user is upgraded to a spotlight reviewer through what essentially is a popularity contest. As a result, their reviews are given more prominence. Top reviews are those which receive the most helpful upvotes, usually because they provide more detailed information about a product. One potential way of improving rating and commenting systems is to weight such ratings according to the reputation of the rater (as done on Amazon, eBay, and Wikipedia). Reputation systems intend to achieve three things: foster good behavior, penalize bad behavior, and reduce the risk of harm to others as a result of bad behavior (Ubois, 2003). Key features are that reputation can rise and fall and that reputation is based on behavior rather than social connections, thus prioritizing engagement over popularity. In addition, reputation systems do not have to use the true names of the participants but, to be effective and robust, they must be tied to an enduring identity infrastructure. Frishauf (2009) proposed a reputation system for peer review in which the review would be undertaken by people of known reputation, thereby setting a quality threshold that could be integrated into any social review platform and automated (e.g., via ORCID). One further problem with reputation systems is that having a single formula to derive reputation leaves the system open to gaming, as rationally expected with almost any process that can be measured and quantified. Gashler (2008) proposed a decentralized and secured system where each reviewer would digitally sign each paper, hence the digital signature would link the review with the paper. Such a web of reviewers and papers could be data mined to reveal information on the influence and connectedness of individual researchers within the research community. Depending on how the data were mined, this could be used as a reputation system or web-of-trust system that would be resistant to gaming because it would specify no particular metric.

Tennant JP, Dugan JM, Graziotin D et al. A multi-disciplinary perspective on emergent and future innovations in peer review [version 3; referees: 2 approved]. F1000Research 2017, 6:1151 (doi: 10.12688/f1000research.12037.3)

Green Tea and Velociraptors