Monday, June 20, 2016

Sentiment Analysis and Topic Detection in R using Microsoft Cognitive Services

In our previous post, we covered {mscsweblm4r}, a R package available on CRAN that wraps the Microsoft Cognitive Services Web Language Model REST API. In today's post, we'll go over {mscstexta4r}, a new R package that gives us access to another silo of Microsoft's NLP technology, its Text Analytics API.

What does {mscstexta4r}  do?

Per Microsoft's website, the Microsoft Cognitive Services Text Analytics REST API is a suite of text analytics web services built with Azure Machine Learning to analyze unstructured text. By exposing R bindings for this API, {mscstexta4r} allows the following operations:

  • Sentiment analysis - Is a sentence or document generally positive or negative?
  • Topic detection - What's being discussed across a list of documents/reviews/articles?
  • Language detection - What language is a document written in?
  • Key talking points extraction - What's being discussed in a single document?

Sentiment analysis

The API returns a numeric score between 0 and 1. Scores close to 1 indicate positive sentiment and scores close to 0 indicate negative sentiment. Sentiment score is generated using classification techniques. The input features of the classifier include n-grams, features generated from part-of-speech tags, and word embeddings. English, French, Spanish and Portuguese text are supported.

Sentiment analysis is often applied to product and business reviews (Amazon, Yelp, TripAdvisor, etc.) for marketing/customer service purposes. It is also increasingly used in fintech for stock prediction using Twitter opinion mining, general stock market behavior prediction, etc.

Topic detection

This API returns the detected topics for a list of submitted text records. A topic is identified with a key phrase, which can be one or more related words. This API requires a minimum of 100 text records to be submitted, but is designed to detect topics across hundreds to thousands of records. The API is designed to work well for short, human-written text such as reviews and user feedback. English is the only language supported at this time.

Topic detection/mining can certainly help identify predominant themes across vast quantities of reviews but it can also be used to detect and index similar documents in large corpora, or cluster documents by their inferred topic.

Language detection

This API returns the detected language and a numeric score between 0 and 1. Scores close to 1 indicate 100% certainty that the identified language is correct. A total of 120 languages are supported.

Extraction of key talking points

This API returns a list of strings denoting the key talking points in the input text. English, German, Spanish, and Japanese text are supported.

How do I install this package?

First things first: before you can use the {mscstexta4r} R package, you will need to have a valid account with Microsoft Cognitive Services. Once you have an account, Microsoft will provide you with a (free) API key listed under your subscriptions. After you've configured {mscstexta4r} with your API key, as explained in the Reference Manual on CRAN, you will be able to call the Text Analytics REST API from R, up to your maximum number of transactions per month and per minute -- again, for free.

You can install the latest stable version of {mscstexta4r} from CRAN as follows:

install.packages("mscstexta4r")

After following the Reference Manual instructions regarding API key configuration, you'll be able to use the package with:


library(mscstexta4r)
textaInit()

How do I use {mscstexta4r}?

With this package, as with {mscsweblm4r}, we've tried to hide as much of the complexity associated with RESTful API HTTP calls as possible. The following example demonstrates how trivial it is to use the sentiment analysis API from R:


docsText <- c(
  "Loved the food, service and atmosphere! We'll definitely be back.",
  "Very good food, reasonable prices, excellent service.",
  "It was a great restaurant.",
  "If steak is what you want, this is the place.",
  "The atmosphere is pretty bad but the food is quite good.",
  "The food is quite good but the atmosphere is pretty bad.",
  "The food wasn't very good.",
  "I'm not sure I would come back to this restaurant.",
  "While the food was good the service was a disappointment.",
  "I was very disappointed with both the service and my entree."
)
docsLanguage <- rep("en", length(docsText))

tryCatch({

  # Perform sentiment analysis
  textaSentiment(
    documents = docsText,    # Input sentences or documents
    languages = docsLanguage
    # "en"(English, default)|"es"(Spanish)|"fr"(French)|"pt"(Portuguese)
)

}, error = function(err) {

  # Print error
  geterrmessage()

})
#> texta [https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment]
#> 
#> --------------------------------------
#>              text               score 
#> ------------------------------ -------
#>  Loved the food, service and   0.9847 
#>  atmosphere! We'll definitely         
#>            be back.                   
#> 
#>   Very good food, reasonable   0.9831 
#>   prices, excellent service.          
#> 
#>   It was a great restaurant.   0.9306 
#> 
#>   If steak is what you want,   0.8014 
#>       this is the place.              
#> 
#>  The atmosphere is pretty bad  0.4998 
#>  but the food is quite good.          
#> 
#> The food is quite good but the  0.475 
#>   atmosphere is pretty bad.           
#> 
#>   The food wasn't very good.   0.1877 
#> 
#> I'm not sure I would come back 0.2857 
#>      to this restaurant.              
#> 
#>  While the food was good the   0.08727
#> service was a disappointment.         
#> 
#>  I was very disappointed with  0.01877
#>    both the service and my            
#>            entree.                    
#> --------------------------------------

It doesn't get any simpler than this. The results are conveniently formatted as a dataframe, as with all the other {mscstexta4r} functions. We've already covered in detail why it is important to use tryCatch() in our previous post, so we won't go over it again, except to remind you that HTTP requests over a network and the Internet can fail. tryCatch(), in our opinion, offers the best mechanism to handle those failures.

Synchronous vs asynchronous execution

All but one core text analytics functions execute exclusively in synchronous mode. textaDetectTopics() is the only function that can be executed either synchronously or asynchronously. Why? Because topic detection is typically a "batch" operation meant to be performed on thousands of related documents (product reviews, research articles, etc.).

What's the difference?

When textaDetectTopics() executes synchronously, you must wait for it to finish before you can move on to the next task. When textaDetectTopics() executes asynchronously, you can move on to something else before topic detection has completed.

When to run which mode

If you're performing topic detection in batch mode (from an R script), we recommend using the textaDetectTopics() in synchronous mode, in which case, again, it will return only after topic detection has completed.

If you need to operate the same R console while topic detection is being performed by the Microsoft Cognitive services servers, you should call textaDetectTopics() in asynchronous mode and then call textaDetectTopicsStatus() periodically yourself, until the Microsoft Cognitive Services server complete topic detection and results become available.

For additional details -- and plenty of sample code! -- please take a look at the Vignette included with the package or review the README in the package's GitHub repo.

How fast is {mscstexta4r}?

The answer to that question depends on many factors, including the kind of Internet service provider you are using. In our case, with this package and {mscsweblm4r}, we've been very impressed by the low latency of most HTTP calls, despite our less-than-stellar ISP.

This applies to sentiment analysis, language detection, and key talking points extraction, for which we get results almost immediately on a small number of short documents. With topic detection, the API requires a minimum of 100 documents. In our experience, topic detection often took between 7 and 9 minutes on batches of 100 typical Yelp-size reviews (if there is such a thing).

If you'd like to get a feel for the speed of execution of the {mscstexta4r} package without having to write any code for it, you may want to try MSCSShiny, the web app we created to test/demo our packages. You can try it live on shinyapps.io, or download it from GitHub. Note that if the web app on shinyapps.io doesn't show up, its monthly max usage for the free tier has already been exceeded. If the app shows up but doesn't give you results, it is because there are either too many people trying to use it at the same time, or the Microsoft Cognitive Services rate limiters have kicked in. For a fair assessment, I highly recommend that you install the web app from GitHub, set it up to use your personal API key, and run it on your own machine inside the wonderful RStudio.

Have fun!

Links

{mscstexta4r} on CRAN, on GitHub
MSCSShiny on shinyapps.io, on GitHub

All Microsoft Cognitive Services components are Copyright (c) Microsoft.