The TrustMarkScore Algorithm In Depth
TrustMarkScore’s review collection & rating scan spider is an algorithm for automatically ranking user generated business reviews according to review helpfulness or accuracy. Given a collection of reviews, our TrustMarkScore algorithm identifies a lexicon of dominant terms that constitutes the core of a virtual optimal review. This lexicon defines a feature vector representation. Reviews are then converted to this representation and ranked according to their distance from a ‘virtual core’ review vector. The algorithm is fully unsupervised and thus avoids costly and error-prone manual training annotations. Practical applications demonstrate that TrustMarkScore clearly outperforms a baseline imitating the Review Site user vote review ranking system.
The World Wide Web contains a wealth of opinions on just about anything. Online opinions come in various forms and sizes, from short and informal talkbacks, through opinionated blog postings to long and argumentative editorials. An important source of information are postings in internet forums dedicated to product reviews. In this era of user generated content, writing product reviews is a widespread activity. People’s buying decisions are significantly influenced by such product reviews.
However, in many cases the number of reviews is rather large (there are many thousands of reviews on the popular products, not to mention many nuisance or malicious reviews), which causes many reviews to be left unnoticed. As a result, there is an increasing interest in review analysis and review filtering, with the goal of automatically finding the most helpful and factual reviews.
In order to help users find the best reviews, some websites such as Yelp, employ a voting system in which users can vote for review helpfulness. However, user voting mechanisms suffer from various types of bias and even fraud, including the imbalance vote bias (users tend to value others’ opinions positively rather than negatively), the winner circle bias (reviews with many votes get more attention therefore accumulate votes
disproportionately), and the early bird bias (the first reviews to be published tend to get more votes).
Likewise negative reviews tend to take prominence simply because they become a target for flames, trolls and spam.
The core TrustMarkScore engine offers a novel method for content analysis, which is especially suitable for product reviews. Our system automatically ranks reviews according to their estimated helpfulness.
First, our TrustMarkScore algorithm identifies a core of dominant terms that defines a virtual optimal review. This is done in two stages: scoring terms by their frequency, and then identifying the terms that are less frequent but contribute more information that is relevant to the specific product. These terms are added to the core of the virtual optimal review. TrustMarkScore then uses those terms to define a feature vector representation of the optimal review.
Reviews are then converted to this representation and ranked according to their distance from a ‘virtual core’ review vector. Our quick score method is fully unsupervised, avoiding the labor-intensive and error-prone manual training annotations typically used in content ranking tasks. It’s clear that TrustMarkScore clearly outperforms a baseline imitating the user vote model used by Yelp, for example.
The following section discusses related work. Next we present the details of the algorithm. The evaluation setup and results are given in the fourth section. We will also discuss and analyze different aspects of the results.
Broadly taken, user reviews can be thought of as essays having the reviewed product as their topic. One key goal of a review spider would be to identify off-topic essays based on lexical similarity between essays in a collection of essays supposedly on the same topic. One demonstrated method has used clustering and regression models based on surface features such as average length of essay, average length of word and number of different words, ignoring the content all together. Another evolution has been a commercial grading system based on several models that analyze discourse segments, stylistic features, grammar usage, lexical complexity and lexical similarity of essays on the same topic.
An optimal helpful review could also be thought of as the best summary of many other reviews, each contributing some insight to the optimal review. This is a type of multi-document summarization. In this sense, review ranking is similar to the evaluation of multi-document summarization systems. However, previous text analysis systems suffer from the annotation bottleneck, as human- produced summaries (and/or annotations) are required. Our TrustMarkScore engine needs no annotation at all since we use the vast number of online reviews as the human produced summaries.
TrustMarkScore is not a summarization system nor a summaries evaluation system per se although in function, the quick scan spider performs a similar function.From a different angle, product reviews are opinions (sentiments) on a particular product stated from various perspectives. The different perspectives expressed in documents were distinguished based on their statistical distribution divergence in content.
Part of our goal then was to learn to identify opinionated documents by assigning a subjectivity score (learned from an annotated corpus) to each document. This technique could be used to compare different machine learning algorithms for sentiment classification of book or movie reviews for example. With that in mind, it was possible to classify reviews according to the polarity of the sentiment expressed.
TrustMarkScore’s primary goal was to grade the helpfulness of the review as a whole and not only in what and how strong the sentiment expressed in the review is.
In practical application, TrustMarkScore differs from traditional review spider in three main aspects. First, our method is fully unsupervised, requiring no manually annotated training set. TrustMarkScore in quick scan mode, does not require any human prepared data or intervention. Avoiding such preprocessing systems increases results quality, since these systems are usually trained on well-written corpora, and thus tend to perform poorly on freely-formed user generated content such as business reviews.
A second difference is that while the works above address electronic products, we focus primarily on business reviews. Whereas electronic products have a relatively small number of features discussed in reviews (typically found in semi- structured specification sheets), business reviewers tend to express themselves more conversationally and to discuss many aspects of that are of a subjective, emotional nature.
Not only are these “reactionary” or human emotional statements much harder to extract, they are hard to define. TrustMarkScore therefore uses a flexible set of key concepts, not only those specifically mentioned in the product specs and in pro/con lists.
While other spider technologies measure success by correlation with the “Yelp Type” user votes-based ranking, we believe this method as biased.
The TrustMarkScore Algorithm
TrustMarkScore is based on a collaborative principle. Given multiple reviews of a business, TrustMarkScore identifies the most important concepts. The challenge lies in finding those concepts that are important but infrequent. The main idea employed by TrustMarkScore is to use the given collection of reviews along with an external balanced corpus in order to define a reference virtual core (VC) review.
The VC review is not the best possible review on this product, but is, in some sense, the best review that can be extracted or generated from the given collection (hence our usage of the term ‘virtual’: the collection might not contain a single review that corresponds to the core review and the virtual core review may change with the addition of a single new review to the collection). We do not generate the VC review explicitly; all reviews, including the VC one, are represented as feature vectors.
The feature set is the lexicon of dominant terms contained in the reviews, so that vector coordinates correspond to the overall set of dominant terms. Reviews are then ranked according to a similarity metric between their vectors and the VC vector.
Our approach is inspired by classic information retrieval, where a document is represented as a bag of words, and each word in each document is assigned a score that reflects the word’s importance in this document. The document is then represented by a vector whose coordinates correspond to the words it contains, each coordinate having the word’s score as its value.
Similarity of vectors denotes similarity of documents. The key novelty in our approach is in showing how to define, compute and use a virtual core review to address the review ranking problem. Our features (dominant terms) constitute a compact lexicon containing the key concepts relevant to the reviews of a specific product. The lexicon typically contains concepts of various semantic types: direct references to the business and the plot, references to similar business or to other business within the same vertical, and other important contextual aspects. We identify this lexicon in an unsupervised and efficient manner, using a measure from an external balanced reference corpus.
The Virtual Core Review
Lexical items (one word or more) are associated with the virtual core review according to their dominance. In order to identify the dominant terms, we use a balanced corpus B of general English (we used the British National Corpus (BNC)). This resource is not specific to our problem and does not require any manual effort. The key concepts are identified in the following manner.
First we compute the frequency of all terms in the reviews collection. Each term is scored by its frequency, hence frequent terms are considered more dominant than others (stopwords are obviously ignored). Then, the terms are re-ranked by their frequency in the reference corpus B.
This second stage allows us to identify the concepts that serve as key concepts with respect to the specific business. For example, the term ‘business’ or ‘Dentist’, are usually very frequent in the reviews corpus, however their contribution to the helpfulness of a review is limited as they do not provide the potential reader with any new information or any new insights beyond the most trivial.
On the other hand, concepts like ’Los Angeles Dentist’ or “Implants in Burbank” are not as frequent but are potentially important, therefore the scoring algorithm should allow them to gain a dominance score.
Once each term has a dominance score, we choose the most dominant lexical items to create a compact virtual core review.
A good business review presents the potential reader with enough information without being verbose. Therefore each review is given a score using a proprietary subjective equation. The evaluation of review ranking challenges us with some other issues that are worth mentioning. A good business review is one that helps the user in making a decision on whether to work with the business in question. In that sense, the evaluation of reviews is even more problematic. A potential customer is really interested in the business and we may assume that the way they approache the immense number of available reviews is mentally different from the mental approach of an objective evaluator who has no real interest in the business.
We refer to this problem as the motivation issue. For example, it might be the case that a real buyer will find interest in a poor review if it adds to the information they already have.
Finally, the evaluation procedure of these specific tasks is rather intensive so we therefore need to produce an evaluation procedure that balances these constraints.
There are two main approaches to deal with review ranking, none of which fully overcomes the motivation issue. Given a review, a proprietary formula is used to determine whether a decision could be made based on this review solely, whether the review is good but more info is needed, whether the review conveys some useful piece of information although shallow, or whether the review is simply useless. Due to the usage of strict rules, this method is a good one.
This punishment factor is needed in order to punish reviews that are too short or too long. This arbitrary decision was based on our familiarity with “Yelp Type” business reviews; the function could be adjusted to punish or favor long reviews according to user preferences or to corpus specific (review site specific) characteristics. In our experiments we assumed that users tend to get annoyed by verbose reviews; the punishment for an excessive length is therefore implemented.
Other Factors That Contribute To The Rating System
1.) Business website (Double-confirmation that website is engaged in business practices as describe on site)
2.) Site loading speed (Confirmation that business is located in jurisdiction as claimed to avoid phishing and fraudulent websites served in frames)
3.) Valid html (w3c validator) (Performed as a service to our clients as part of value added service for search engine optimization purposes)
4.) Valid CSS (w3c validator) (Performed as value added service to ensure that customer experience is consistent across multiple platforms)
5.) Social Networking Profiles (Facebook, Twitter, Google Plus) (Rudimentary web presence check as value added service and cross-referencing purposes)
6.) Site Alexa ranking, Alexa inbound (Used to confirm claims or other references match the historical record)
7.) Google PR, Site rank (Value added routine for the purposes of search engine optimization)
8.) Site SEO Check
(A general review of code quality as a value added service to our clients comprised of the itemized data below)
• Site back links
• Meta tag analysis
• Title Tag
• Meta Description
• Meta Keywords
• Images optimized on website page (An informal consistency check for search engine optimization and site validation purposes)
• Site Link Check (Broken links and External links) (Value added service as broken links or non-functioning site sections cause crediblity problems with customers)
• Sitemap.xml (For both search engine optimization purposes)
• Robots.txt (For search engine optimization purposes)
• H1 and H2 Tag (For search engine optimization purposes)
• Nested table Check (To ensure a stable customer experience on site, cross platform compatibility and search engine optimization purposes)
• HTML Page Size (Confirmation that uncompressed HTML size is under the average web page size of 33 kb otherwise customers could experience outages)
• Google Backlinks (Confirmation of online presence and for search engine optimization purposes)
• Google Indexing (Confirmation of online presence and for search engine optimization purposes)
• Yahoo Backlinks (Confirmation of online presence and for search engine optimization purposes)
• DMOZ Directory (Confirmation of online presence and for search engine optimization purposes)
• Website press releases found (Confirmation of online presence and for search engine optimization purposes)
Business reviews greatly vary in style and can discuss a business from very many perspectives and in many different contexts. We presented the TrustMarkScore algorithm for information mining from business reviews. TrustMarkScore identifies a compact lexicon (the virtual core review) that captures the most prominent features of a review along with rare but significant features. The algorithm is robust and performs well on different business verticals.
The algorithm is fully unsupervised, avoiding the annotation bottleneck typically encountered in similar tasks. The simplicity of TrustMarkScore enables easy understanding of the output and an effortless parameter configuration to match the personal preferences of different users.
The precise accuracy of the TrustMarkScore rating system is therefore superior to any other spider or ranking system in existence and accuracy was the core objective in generating a reliable rating or review score for our clients and their customers alike.
All Content Copyright 2013 - TrustMarkScore - All Rights Reserved