Tuesday, May 24, 2016

Using the Microsoft Cognitive Services Web Language Model REST API from R

Microsoft Cognitive Services -- formerly known as Project Oxford -- are a set of APIs, SDKs and services that developers can use to add AI features to their applications. Those features include emotion and video detection; facial, speech and vision recognition; and speech and language understanding.

The Microsoft Cognitive Services website provides several code samples that illustrate how to use their APIs from C#, Java, JavaScript, ObjC, PHP, Python, Ruby, and... you guessed it -- if you want to test drive their service from R, you're pretty much on your own.

To experiment with Microsoft's NLP technology in R, we've developed R bindings for a subset of their work, the Web Language Model API.

What is {mscsweblm4r}?

{mscsweblm4r} is a R package, downloadable from CRAN, that simply wraps the Microsoft Cognitive Services Web Language Model REST API,

Per Microsoft's website, this API uses smoothed Backoff N-gram language models (supporting Markov order up to 5) that were trained on four web-scale American English corpora collected by Bing (web page body, title, anchor and query).

The MSCS Web Language Model REST API supports four lookup operations:
  • Calculate the joint probability that a sequence of words will appear together.
  • Compute the conditional probability that a specific word will follow an existing sequence of words.
  • Get the list of words (completions) most likely to follow a given sequence of words.
  • Insert spaces into a string of words adjoined together without any spaces (hashtags, URLs, etc.).

How do I get started?

To use the {mscsweblm4r} package, you **MUST** have a valid account with Microsoft Cognitive Services. Once you have an account, Microsoft will provide you with a (free) API key. The key will be listed under your subscriptions.

At the time of this writing, we're allowed 100,000 transactions per month, 1,000 per minute, more than enough for us to evaluate the technology -- for free...

After you've configured {mscsweblm4r} with your API key (see the Reference manual on CRAN for details), you will be able to call the Web Language Model REST API from R, up to your maximum number of transactions per month and per minute.

How hard is it to use?
{mscsweblm4r} tries to make it as simple as possible. Like most other published packages, you can install it from CRAN with:

install.packages("mscsweblm4r")

After you've followed the Reference Manual instructions regarding API key configuration, you'll be ready to use the package with:

library(mscsweblm4r)
weblmInit()

Here's a trivial example that allows you to get the list of word candidates most likely to follow a very common sentence beginning -- "how are you"


tryCatch({

  # Generate next words
  weblmGenerateNextWords(
    precedingWords = "how are you",  # ASCII only
    modelToUse = "title",            # "title"|"anchor"|"query"(default)|"body"
    orderOfNgram = 4L,               # 1L|2L|3L|4L|5L(default)
    maxNumOfCandidatesReturned = 5L  # Default: 5L
  )

}, error = function(err) {

  # Print error
  geterrmessage()

})
#> weblm [https://api.projectoxford.ai/text/weblm/v1.0/generateNextWords?model=title&words=how%20are%20you&order=4&maxNumOfCandidatesReturned=5]
#>
#> ---------------------
#>  word    probability
#> ------- -------------
#>  doing     -1.105
#>
#>   in       -1.239
#>
#> feeling    -1.249
#>
#>  going     -1.378
#>
#>  today      -1.43
#> ---------------------

No surprise there -- "how are you doing" is a common expression, indeed. What's with the negative probability, though? In the world of N-gram probabilistic language models, using log10(probabilities) is a common strategy to avoid numeric underflows when multiplying very small numbers (n-gram probabilities) together -- see page 22 of Dan Jurafsky's PDF on N-gram Language Models, here.

What matters more, here, is that weblmGenerateNextWords()'s results are conveniently formatted as a data.frame, as is the case for all {mscsweblm4r} core functions.

Why do I need to use tryCatch()?

The Web Language Model API is a RESTful API. HTTP requests over a network and the Internet can fail. Because of congestion, because the web site might be down for maintenance, because you decided to enjoy your cup of Starbucks coffee and work in the courtyard and your wireless connection suddenly dropped, etc. There are many possible points of failure.

The API call can also fail if you've exhausted your call volume quota or are exceeding the API calls rate limit. Unfortunately, Microsoft Cognitive Services does not expose an API you can query to check if you're about to exceed your quota, for instance. The only way you'll know is by looking at the error code returned after an API call has failed.

Therefore, you must write your R code with failure in mind. And for that, our preferred way is to use tryCatch(). Its mechanism may appear a bit daunting at first, but it is well documented.

We've also included many examples in the Reference manual that illustrate how to use it with other {mscsweblm4r} functions.

How fast is {mscsweblm4r}?

The internet service provider we use (and that shall rename nameless) isn't exactly known for its performance, to say the least. Despite this, we've often been floored by how low the latency was for most HTTP requests.

Yes, this is purely anecdotal, and your experience may turn out differently from ours. What we're hoping, though, is that some of you will be interested in using our package as a building block for your own evaluation framework. If you do, please share your results with us. We will gladly make room for them on the package's GitHub page for the NLP community at large to review.

Can I try a live demo?

Sure, you can. To validate our package, we developed a Shiny web app called MSCSShiny. The source code is available here. We've also published the application to shinyapps.io.

After our free shinyapps.io credits for this app expire, if you still want to try the web application, you'll be able to download the code from GitHub and run it locally, or push it to your favorite publishing platform. In all cases, you will have to configure it to use your own Web LM API key.

Enjoy!

Links

{mscsweblm4r} on CRAN

{mscsweblm4r} on GitHub

MSCSShiny on GitHub

MSCSShiny on shinyapps.io

All Microsoft Cognitive Services components are Copyright © Microsoft.

No comments:

Post a Comment