How I Use Vagrant and Docker in Consultancy Projects

December 9, 2015, 12:00 am

≫ Next: How to Run A Shiny App in the Cloud Using Tutum, Digital Ocean and Docker Containers

≪ Previous: Deploying Your Very Own Shiny Server

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Doug Ashton – Data Scientist, UK

Just like you I like to try out all the latest tech. If there’s a new feature in Shiny then I’ll download the latest version without thinking. I’ve currently got 4 versions of R on my laptop, 270 packages, 2 versions of Java, and a number of other open source tools. While being on the cutting edge is part of my job, this conflicts with the need for strict audit and reproducibility requirements that we have for project work.

One problem with R is that due to the fast changing nature of CRAN it can be difficult to gain a consistent combination of packages across your team and production servers. The R community has responded to this problem with a number of noteworthy packages for managing package libraries, such as packrat, checkpoint, switchr and our own pkgsnap. Another approach is to use the MRAN mirror to freeze CRAN to a particular date.

A bigger problem is how R is interacting with the various system depenedencies you have installed. At Mango this is why we use continuous integration and unit testing to make sure our results are reproducible on dedicated build servers. Even this can leave you scratching your head when tests don’t match.

All this led us to look for a better way of working. We needed an environment that was easily reproducible, and more in line with the production environment we are deploying to. We’ve already been using Docker for some time so this was the natural choice.

Docker

As described in a previous post, Docker is designed to provide an isolated, portable and repeatable wrapper around your applications. We use this in a number of ways:

1. Reproducible environments

Each project can run inside its own container, completely sandboxed from the rest of your system. We have a number of base images, each built on specific R versions and provisioned with standard sets of packages (using our pkgsnap package) and RStudio Server. Each project can build on one of these images with any specific package dependencies. The recipe to build this image is stored in the Dockerfile that can be saved in the project directory. An example project Docker file is shown in this demonstration.

2. System dependencies

If there are system dependencies such as database connections or external libraries, then building an image with these installed makes it much easier to distribute the project to others. This also makes Docker a great way of trying a new technology without the pain of installing it on your system. For example the excellent Jupyter/all-spark-notebook has everything you need to get started with Spark from R, Python or Scala.

3. Scalability

Once you’re used to working in containers it can significantly lower the barrier to scaling up the compute power when needed. Your container will work just the same on your laptop and a 32 core EC2 instance. You just spin up a node, pull the image and deploy your application. Multiple containers from the same image can be spawned across a grid in seconds and a small scale Spark cluster can be swapped out for a much larger one.

Vagrant

For larger software development projects we also use Vagrant as a tool for reproducible development environments. As described in an earlier post Vagrant is a set of command line tools for managing virtual machines (VMs). This creates a dedicated VM for each project that is consistent across the development team and only creates a small file in version control.

More resources

I recently gave a presentation on this topic at the LondonR user group. You can find the slides here. Some example Docker images are available on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

How to Run A Shiny App in the Cloud Using Tutum, Digital Ocean and Docker Containers

December 10, 2015, 7:16 am

≫ Next: How to Learn R

≪ Previous: How I Use Vagrant and Docker in Consultancy Projects

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

Via RBloggers, I spotted this post on Deploying Your Very Own Shiny Server. I’ve been toying with the idea of running some of my own Shiny apps, so that post provided a useful prompt, though way too involved for me;-)

So here’s what seems to me to be an easier, rather more pointy-clicky, wiring stuff together way using Docker containers (though it might not seem that much easier to you the first time through!). The recipe includes: github, Dockerhub, Tutum and Digital Ocean.

To being with, I created a minimal shiny app to allow the user to select a CSV file, upload it to the app and display it. The ui.R and server.R files, along with whatever else you need, should be placed into an application directory, for example shiny_demo within a project directory, which I’m confusingly also calling shiny_demo (I should have called it something else to make it a bit clearer – for example, shiny_demo_project.)

The shiny server comes from a prebuilt docker container on dockerhub – rocker/shiny.

This shiny server can run several Shiny applications, though I only want to run one: shiny_demo.

I’m going to put my application into it’s own container. This container will use the rocker/shiny container as a base, and simply copy my application folder into the shiny server folder from which applications are served. My Dockerfile is really simple and contains just two lines – it looks like this and goes into a file called Dockerfile in the project directory:

FROM rocker/shiny
ADD shiny_demo /srv/shiny-server/shiny_demo

The ADD command simply copies the the contents of the child directory into a similarly named directory in the container’s /srv/shiny-server/ directory. You could add as many applications you wanted to the server as long as each is in it’s own directory. For example, if I have several applications:

docker-containers

I can add the second application to my container using:

ADD shiny_other_demo /srv/shiny-server/shiny_other_demo

The next thing I need to do is check-in my shiny_demo project into Github. (I don’t have a how to on this, unfortunately…) In fact, I’ve checked my project in as part of another repository (docker-containers).

docker-containers_shiny_demo_at_master_·_psychemedia_docker-containers

The next step is to build a container image on DockerHub. If I create an account and log in to DockerHub, I can link my Github account to it.

I can then create an Automated Build that will build a container image from my Github repository. First, identify the repository on my linked Github account and name the image:

Docker_Hub_

Then add the path the project directory that contains the Dockerfile for the image you’re interested in:

Docker_Hub_git

Click on Trigger to build the image the first time. In the future, every time I update that folder in the repository, the container image will be rebuilt to include the updates.

So now I have a Docker container image on Dockerhub that contains the Shiny server from the rocker/shiny image and a copy of my shiny application files.

Now I need to go Tutum (also part of the Docker empire), which is an application for launching containers on a range of cloud services. If you link your Digital Ocean account to tutum, you can use tutum to launch docker containers on Dockerhub on a Digital Ocean droplet.

Within tutum, you’ll need to create a new node cluster on Digital Ocean:

(Notwithstanding the below, I generally go for a single 4GB node…)

Now we need to create a service from a container image:

I can find the container image I want to deploy on the cluster that I previously built on Dockerhub:

New_Service_Wizard___Tutum1

Select the image and then configure it – you may want to rename it, for example. One thing you definitely need to do though is tick to publish the port – this will make the shiny server port visible on the web.

New_Service_Wizard___Tutum3a

Create and deploy the service. When the container is built, and has started running, you’ll be told where you can find it.

Uploading_Files_and_Welcome_to_Shiny_Server__and_shiny-demo-0aed1c00-1___Tutum

Note that if you click on the link to the running container, the default URL starts with tcp:// which you’ll need to change to http://. The port will be dynamically allocated unless you specified a particular port mapping on the service creation page.

To view your shiny app, simply add the name of the folder the application is in to the URL.

When you’ve finished running the app, you may want to shut the container down – and more importantly perhaps, switch the Digital Ocean droplet off so you don’t continue paying for it!

Node_dashboard___Tutum

As I said at the start, the first time round seems quite complicated. After all, you need to:

create a Github account
create a Dockerhub account
link your Github account to your Docker account
create a Digital Ocean account [affiliate link: sign up to Digital Ocean and get $10 credit, and they’ll tip me some credit too…]
create a tutum account
link your Digital Ocean account to your tutum account

(Actually, you can miss out the dockerhub steps, and instead link your github account to your tutum account and do the automated build from the github files within tutum: Tutum automatic image builds from GitHub repositories. The service can then be launched by finding the container image in your tutum repository)

However, once you do have your project files in github, you can then easily update them and easily launch them on Digital Ocean. In fact, you can make it even easier by adding a deploy to tutum button to a project README.md file in Github.

PS to test the container locally, I launch a docker terminal from Kitematic, cd into the project folder, and run something like:

docker build -t psychemedia/shinydemo .
docker run --name shinydemo -i -t psychemedia/shinydemo

I can then set the port map and find a link to the server from within Kitematic.

To leave a comment for the author, please follow the link and comment on their blog: OUseful.Info, the blog... » Rstats.

↧

How to Learn R

December 10, 2015, 12:40 pm

≫ Next: Phyllotaxis By Shiny

≪ Previous: How to Run A Shiny App in the Cloud Using Tutum, Digital Ocean and Docker Containers

There are tons of resources to help you learn the different aspects of R, and as a beginner this can be overwhelming. It’s also a dynamic language and rapidly changing, so it’s important to keep up with the latest tools and technologies.

That’s why R-bloggers and DataCamp have worked together to bring you a learning path for R. Each section points you to relevant resources and tools to get you started and keep you engaged to continue learning. It’s a mix of materials ranging from documentation, online courses, books, and more.

Just like R, this learning path is a dynamic resource. We want to continually evolve and improve the resources to provide the best possible learning experience. So if you have suggestions for improvement please email tal.galili@gmail.com with your feedback.

Learning Path

Getting started: The basics of R

Setting up your machine

R packages

Importing your data into R

Data Manipulation

Data Visualization

Data Science & Machine Learning with R

Reporting Results in R

Next steps

Getting started: The basics of R

The best way to learn R is by doing. In case you are just getting started with R, this free introduction to R tutorial by DataCamp is a great resource as well the successor Intermediate R programming (subscription required). Both courses teach you R programming and data science interactively, at your own pace, in the comfort of your browser. You get immediate feedback during exercises with helpful hints along the way so you don’t get stuck.

Another free online interactive learning tutorial for R is available by O’reilly’s code school website called try R. An offline interactive learning resource is swirl, an R package that makes if fun and easy to become an R programmer. You can take a swirl course by (i) installing the package in R, and (ii) selecting a course from the course library. If you want to start right away without needing to install anything you can also choose for the online version of Swirl.

There are also some very good MOOC’s available on edX and Coursera that teach you the basics of R programming. On edX you can find Introduction to R Programming by Microsoft, an 8 hour course that focuses on the fundamentals and basic syntax of R. At Coursera there is the very popular R Programming course by Johns Hopkins. Both are highly recommended!

If you instead prefer to learn R via a written tutorial or book there is plenty of choice. There is the introduction to R manual by CRAN, as well as some very accessible books like Jared Lander’s R for Everyone or R in Action by Robert Kabacoff.

Setting up your machine

You can download a copy of R from the Comprehensive R Archive Network (CRAN). There are binaries available for Linux, Mac and Windows.

Once R is installed you can choose to either work with the basic R console, or with an integrated development environment (IDE). RStudio is by far the most popular IDE for R and supports debugging, workspace management, plotting and much more (make sure to check out the RStudio shortcuts).

Next to RStudio you also have Architect, and Eclipse-based IDE for R. If you prefer to work with a graphical user interface you can have a look at R-commander (aka as Rcmdr), or Deducer.

R packages

R packages are the fuel that drive the growth and popularity of R. R packages are bundles of code, data, documentation, and tests that are easy to share with others. Before you can use a package, you will first have to install it. Some packages, like the base package, are automatically installed when you install R. Other packages, like for example the ggplot2 package, won’t come with the bundled R installation but need to be installed.

Many (but not all) R packages are organized and available from CRAN, a network of servers around the world that store identical, up-to-date, versions of code and documentation for R. You can easily install these package from inside R, using the install.packages function. CRAN also maintains a set of Task Views that identify all the packages associated with a particular task such as for example TimeSeries.

Next to CRAN you also have bioconductor which has packages for the analysis of high-throughput genomic data, as well as for example the github and bitbucket repositories of R package developers. You can easily install packages from these repositories using the devtools package.

Finding a package can be hard, but luckily you can easily search packages from CRAN, github and bioconductor using Rdocumentation, inside-R, or you can have a look at this quick list of useful R packages.

To end, once you start working with R, you’ll quickly find out that R package dependencies can cause a lot of headaches. Once you get confronted with that issue, make sure to check out packrat (see video tutorial) or checkpoint. When you’d need to update R, if you are using Windows, you can use the updateR() function from the installr package.

Importing your data into R

The data you want to import into R can come in all sorts for formats: flat files, statistical software files, databases and web data.

Getting different types of data into R often requires a different approach to use. To learn more in general on how to get different data types into R you can check out this online Importing Data into R tutorial (subscription required), this post on data importing, or this webinar by RStudio.

Flat files are typically simple text files that contain table data. The standard distribution of R provides functionality to import these flat files into R as a data frame with functions such as read.table() and read.csv() from the utils package. Specific R packages to import flat files data are readr, a fast and very easy to use package that is less verbose as utils and multiple times faster (more information), and data.table’s fread() function for importing and munging data into R (using the fread function).

In case you want to get your excel files into R, it’s a good idea to have a look at the readxl package. Alternatively, there is the gdata package which has function that supports the import of Excel data, and the XLConnect package. The latter acts as a real bridge between Excel and R meaning you can do any action you could do within Excel but you do it from inside R. Read more on importing your excel files into R.

Software packages such as SAS, STATA and SPSS use and produce their own file types. The haven package by Hadley Wickham can deal with importing SAS, STATA and SPSS data files into R and is very easy to use. Alternatively there is the foreign package, which is able to import not only SAS, STATA and SPSS files but also more exotic formats like Systat and Weka for example. It’s also able to export data again to various formats. (Tip: if you’re switching from SAS,SPSS or STATA to R, check out Bob Muenchen’s tutorial (subscription required))

The packages used to connect to and import from a relational database depend on the type of database you want to connect to. Suppose you want to connect to a MySQL database, you will need the RMySQL package. Others are for example the RpostgreSQL and ROracle package.The R functions you can then use to access and manipulate the database, is specified in another R package called DBI.

If you want to harvest web data using R you need to connect R to resources online using API’s or through scraping with packages like rvest. To get started with all of this, there is this great resource freely available on the blog of Rolf Fredheim.

Data Manipulation

Turning your raw data into well structured data is important for robust analysis, and to make data suitable for processing. R has many built-in functions for data processing, but they are not always that easy to use. Luckily, there are some great packages that can help you:

The tidyr package allows you to “tidy” your data. Tidy data is data where each column is a variable and each row an observation. As such, it turns your data into data that is easy to work with. Check this excellent resource on how you can tidy your data using tidyr.
If you want to do string manipulation, you should learn about the stringr package. The vignette is very understandable, and full of useful examples to get you started.
dplyr is a great package when working with data frame like objects (in memory and out of memory). It combines speed with a very intuitive syntax. To learn more on dplyr you can take this data manipulation course (subscription required) and check out this handy cheat sheet.
When performing heavy data wrangling tasks, the data.table package should be your “go-to”package. It’s blazingly fast, and once you get the hang of it’s syntax you will find yourself using data.table all the time. Check this data analysis course (subscription required) to discover the ins and outs of data.table, and use this cheat sheet as a reference.
Chances are you find yourself working with times and dates at some point. This can be a painful process, but luckily lubridate makes it a bit easier to work with. Check it’s vignette to better understand how you can use lubridate in your day-to-day analysis.
Base R has limited functionality to handle time series data. Fortunately, there are package like zoo, xts and quantmod. Take this tutorial by Eric Zivot to better understand how to use these packages, and how to work with time series data in R.

If you want to have a general overview of data manipulation with R, you can read more in the book Data Manipulation with R or see the Data Wrangling with R video by RStudio. In case you run into troubles with handling your data frames, check 15 easy solutions to your data frame problems.

Data Visualization

One of the things that make R such a great tool is its data visualizations capabilities. For performing visualizations in R, ggplot2 is probably the most well known package and a must learn for beginners! You can find all relevant information to get you started with ggplot2 onhttp://ggplot2.org/ and make sure to check out the cheatsheet and the upcomming book. Next to ggplot2, you also have packages such as ggvis for interactive web graphics (seetutorial (subscription required)), googleVis to interface with google charts (learn to re-create this TED talk), Plotly for R, and many more. See the task view for some hidden gems, and if you have some issues with plotting your data this post might help you out.

In R there is a whole task view dedicated to handling spatial data that allow you to create beautiful maps such as this famous one:

To get started look at for example a package such as ggmap, which allows you to visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps. Alternatively you can start playing around with maptools, choroplethr, and the tmap package. If you need a great tutorial take this Introduction to visualising spatial data in R.

You’ll often see that visualizations in R make use of all these magnificent color schemes that fit like a glove on the graph/map/… If you want to achieve this for your visualizations as well, then deepen yourself into the RColorBrewer package and ColorBrewer.

One of the latest visualizations tools in R is HTML widgets. HTML widgets work just like R plots but they create interactive web visualizations such as dynamic maps (leaflet), time-series data charting (dygraphs), and interactive tables (DataTables). There are some very nice examples of HTML widgets in the wild, and solid documentation on how to create your own one (not in a reading mode: just watch this video).

If you want to get some inspiration on what visualization to create next, you can have a look at blogs dedicated to visualizations such as FlowingData.

Data Science & Machine Learning with R

There are many beginner resources on how to do data science with R. A list of available online courses:

Andrew Conway’s Introduction to statistics with R (subscription required)
Data Analysis and Statistical Inference
Data Analysis for life sciences
Data Science Specialization by Johns Hopkins

Alternatively, if you prefer a good read:

Practical Data Science With R
R for Data Science (upcomming, see progress)
A Survival Guide to Data Science with R

Once your start doing some machine learning with R, you will quickly find yourself using packages such as caret, rpart and randomForest. Luckily, there are some great learning resources for these packages and Machine Learning in general. If you are just getting started,this guide will get you going in no time. Alternatively, you can have a look at the booksMastering Machine Learning with R and Machine Learning with R. If you are looking for some step-by-step tutorials that guide you through a real life example there is the Kaggle Machine Learning course or you can have a look at Wiekvoet’s blog.

Reporting Results in R

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It is a great tool for reporting your data analysis in a reproducible manner, thereby making the analysis more useful and understandable. R markdown is based on knitr and pandoc. With R markdown, R generates a final document that replaces the R code with its results. This document can be in an html, word, pfd, ioslides, etc. format. You can even create interactive R markdown documents using Shiny. This 4 hour tutorial on Reporting with R Markdown (subscription required) get’s you going with R markdown, and in addition you can use this nice cheat sheet for future reference.

Next to R markdown, you should also make sure to check out Shiny. Shiny makes it incredibly easy to build interactive web applications with R. It allows you to turn your analysis into interactive web applications without needing to know HTML, CSS or Javascript. RStudio maintains a great learning portal to get you started with Shiny, including this set of video tutorials (click on the essentials of Shiny Learning Roadmap). More advanced topics are available, as well as a great set of examples.

Next steps

Once you become more fluent in writing R syntax (and consequently addicted to R), you will want to unlock more of its power (read: do some really nifty stuff). In that case make sure to check out RCPP, an R package that makes it easier for integrating C++ code with R, or RevoScaleR (start the free tutorial).

After spending some time writing R code (and you became an R-addict), you’ll reach a point that you want to start writing your own R package. Hilary Parker from Etsy has written a short tutorial on how to create your first package, and if you’re really serious about it you need to read R packages, an upcoming book by Hadley Wickham that is already available for free on the web.

If you want to start learning on the inner workings of R and improve your understanding of it, the best way to get you started is by reading Advanced R.

Finally, come visit us again at R-bloggers.com to read of the latest news and tutorials from bloggers of the R community.

↧

Phyllotaxis By Shiny

December 14, 2015, 3:00 am

≫ Next: GRUPO: Shiny App For Benchmarking Pubmed Publication Output

≪ Previous: How to Learn R

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Antonio, you don’t know what empathy is! (Cecilia, my beautiful wife)

Spirals are nice. In the wake of my previous post I have done a Shiny app to explore patterns generated by changing angle, shape and number of points of Fermat’s spiral equation. You can obtain an almost infinite number of images. This is just an example:
phyllotaxis12

I like thinking in imaginary flowers. This is why I called this experiment Phyllotaxis.
More examples:

Just one comment about code: I did the Shiny in just one R file as this guy suggested me some time ago because of this post.

This is the code. Do your own imaginary flowers:

library(shiny)
library(ggplot2)
CreatePlot = function (ang=pi*(3-sqrt(5)), nob=150, siz=15, alp=0.8, sha=16, col="black", bac="white") {
ggplot(data.frame(r=sqrt(1:nob), t=(1:nob)*ang*pi/180), aes(x=r*cos(t), y=r*sin(t)))+
geom_point(colour=col, alpha=alp, size=siz, shape=sha)+
scale_x_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+
scale_y_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+
theme(legend.position="none",
panel.background = element_rect(fill=bac),
panel.grid=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
axis.text=element_blank())}
shinyApp(
ui = fluidPage(
titlePanel("Phyllotaxis by Shiny"),
fluidRow(
column(3,
wellPanel(
selectInput("col", label = "Colour of points:", choices = colors(), selected = "black"),
selectInput("bac", label = "Background colour:", choices = colors(), selected = "white"),
selectInput("sha", label = "Shape of points:",
choices = list("Empty squares" = 0, "Empty circles" = 1, "Empty triangles"=2,
"Crosses" = 3, "Blades"=4, "Empty diamonds"=5,
"Inverted empty triangles"=6, "Bladed squares"=7,
"Asterisks"=8, "Crosed diamonds"=9, "Crossed circles"=10,
"Stars"=11, "Cubes"=12, "Bladed circles"=13,
"Filled squares" = 15, "Filled circles" = 16, "Filled triangles"=17,
"Filled diamonds"=18), selected = 16),
sliderInput("ang", label = "Angle (degrees):", min = 0, max = 360, value = 180*(3-sqrt(5)), step = .05),
sliderInput("nob", label = "Number of points:", min = 1, max = 1500, value = 60, step = 1),
sliderInput("siz", label = "Size of points:", min = 1, max = 60, value = 10, step = 1),
sliderInput("alp", label = "Transparency:", min = 0, max = 1, value = .5, step = .01)
)
),
mainPanel(
plotOutput("Phyllotaxis")
)
)
),
server = function(input, output) {
output$Phyllotaxis=renderPlot({
CreatePlot(ang=input$ang, nob=input$nob, siz=input$siz, alp=input$alp, sha=as.numeric(input$sha), col=input$col, bac=input$bac)
}, height = 650, width = 650 )}
)

To leave a comment for the author, please follow the link and comment on their blog: Ripples.

↧

GRUPO: Shiny App For Benchmarking Pubmed Publication Output

December 14, 2015, 6:29 am

≫ Next: R / Shiny User Survey

≪ Previous: Phyllotaxis By Shiny

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service.

The What

GRUPO (Gauging Research University Publication Output) is a Shiny app that provides side-by-side benchmarking of American research university publication activity.

The How

The code behind the app is written in R, and leverages the NCBI Eutils API via the rentrez package interface.

The methodology is fairly simple:

Build the search query in Pubmed syntax based on user input parameters.
Extract total number of articles from results.
Output a visualization of the total counts for both selected institutions.
Extract unique article identifiers from results.
Output the number of article identifiers that match (i.e. “collaborations”) between the two selected institutions.

Build Query

The syntax for the searching Pubmed relies on MEDLINE tags and boolean operators. You can peek into how to use the keywords and build these kinds of queries with the Pubmed Advanced Search Builder.

GRUPO builds its queries based on two fields in particular: “Affiliation” and “Date.” Because this search term will have to be built multiple times (at least twice to compare results for two institutions) I wrote a helper function called build_query():

# use %y/%m/%d (e.g. 1999/02/14) date format for startDate and endDate arguments

build_query = function(institution, startDate, endDate) {

    if (grepl("-", institution)==TRUE) {                
        split_name = strsplit(institution, split="-")
        search_term = paste(split_name[[1]][1], '[Affiliation]',
                             ' AND ',
                             split_name[[1]][2],
                             '[Affiliation]',
                             ' AND ',
                             startDate,
                             '[PDAT] : ',
                             endDate,
                             '[PDAT]',
                             sep='')
        search_term = gsub("-","/",search_term)
    } else {
        search_term = paste(institution, 
                             '[Affiliation]',
                             ' AND ',
                             startDate,
                             '[PDAT] : ',
                             endDate,
                             '[PDAT]',
                             sep='')
        search_term = gsub("-","/",search_term)
    }

    return(search_term)
}

The if/else logic in there accommodates cases like “University of North Carolina-Chapel Hill”, which otherwise wouldn’t search properly in the affiliation field. This method does depend on the institution name having its specific locale separated by a - symbol. In other words, if you passed in “University of Colorado/Boulder” you’d be stuck.

So by using this function for the University of Virginia from January 1, 2014 to January 1, 2015 you’d get the following term:

University of Virginia[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]

And for University of Texas-Austin over the same dates you get the following term:

University of Texas[Affiliation] AND Austin[Affiliation] AND 2014/01/01[PDAT] : 2015/01/01[PDAT]

The advantage of using this function in a Shiny app is that you can pass the institution names and dates dynamically. Users enter the input parameters for which date range and institutions to search via the widgets in the ui.R script.

For the app to work, there has to be one date picker widget and two text inputs (one for each of the two institutions) in the ui.R script. The corresponding server.R script would have a reactive element wrapped around the following:

search_term = build_query(institution = input$institution1, startDate = input$dates[1], endDate = input$dates[2])
search_term2 = build_query(institution = input$institution2, startDate = input$dates[1], endDate = input$dates[2])
### Run Query

With the query built, you can run the search in Pubmed. The entrez_search() function from the rentrez package lets us get the information we want. This function returns four elements:

ids (unique Pubmed identifiers for each article in the result list)
count (total number of results)
retmax (maximum number of results that could have been returned)
file (the actual XML record containing the values above)

The following code returns total articles for each of two different searches:

affiliation_search = entrez_search("pubmed", search_term, retmax = 99999)
affiliation_search2 = entrez_search("pubmed", search_term2, retmax = 99999)

total_articles = as.numeric(affiliation_search$count)
total_articles2 = as.numeric(affiliation_search2$count)

Plot Results

The code above lives in the server.R script and is the functional workhorse for the app. But to adequately represent the benchmarking, GRUPO needed some kind of plot.

We can combine the total articles for each institution with the institution names, which we used to build the search terms. The result is a tiny (2 x 2) data frame of “Institution” and “Total.Articles” variables. Nothing fancy. But it does the trick.

With a data frame in hand, we can load it into ggplot2 and do some very simple barplotting:

Output Collaborations

Although the primary function of GRUPO is side-by-side benchmarking, it does have at least one other feature so far.

The inclusion of the “ids” object in the query result makes it possible to do something else. You can compare how many of the article identifiers match between two queries. That should represent the number of “collaborations” (i.e. how many of the publications share authorship) between individuals at the two institutions.

To get the total number of collaborations, we can do a simple calculation of length on the vector of intersections between the two search results:

collaboration_count = length(intersect(affiliation_search$ids,affiliation_search2$ids)

By placing the search call inside a reactive element within Shiny, GRUPO can store the results (“count” and “ids”) rather than repeating the query for each purpose.

NB This approach to assessing collaboration counts is spurious when considering articles published before October 2013, which was when the National Library of Medicine (NLM) began including affiliation tags for all authors.

The Next Steps

What’s next? There are a number of potential new features for GRUPO. It’s worth pointing out that a discussion of these possibilities will likely highlight some of the limitations of the app as it exists now.

For example, it would be advantageous to include other “research output” data sources. GRUPO currently only accounts for publications indexed in Pubmed. That’s a fairly one-dimensional representation of scholarly activities. Information about publications indexed elsewhere, funding awarded or altmetric indicators isn’t accounted for.

And neither is any information about the institutions. While all of them are considered to have very high research activity one could argue that some are “apples” and some are “oranges” based on discrepancies in budgets, number of faculty members, student body size, etc. A more thorough benchmarking tool might model research universities based on additional administrative data, and restrict comparisons to “similar” institutions.

So GRUPO is still a work in progress. But it’s a solid example of a Shiny app that effectively leverages an API as its primary data source. Feel free to post a comment if you have any feedback or questions.

Grupo Shiny App: http://apps.bioconnector.virginia.edu/grupo/

Grupo Source Code: https://github.com/vpnagraj/grupo

To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.

↧

R / Shiny User Survey

December 16, 2015, 1:05 pm

≫ Next: RTutor: Public Procurement Auctions: Design, Outcomes and Adaption Costs

≪ Previous: GRUPO: Shiny App For Benchmarking Pubmed Publication Output

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

I have been using R for a few years and have been amazed at the variety of people who use the language. R has an unusually diverse and creative community. If you are an R user, I would really like to learn a bit more about how you use R and Shiny technologies. If you use R but don’t use Shiny – that’s OK – please fill out the survey as well!

The survey is on google docs. No login required: http://goo.gl/forms/TtA71XdsUc

Please forward this link on to others who use R. More data is better!

I plan on providing an analysis of the results in an upcoming post. Thanks for participating!

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

↧

RTutor: Public Procurement Auctions: Design, Outcomes and Adaption Costs

December 18, 2015, 4:00 am

≫ Next: IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser

≪ Previous: R / Shiny User Survey

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

As an economist, I consider public procurement of construction projects a very interesting topic. While several firms may initially bid for a project and propose prices, the news in Germany are regulary filled with construction projects whose costs went far beyond those that were initially agreed upon. For example, Wikipedia states that when the city of Hamburg initially signed the contracts for its new concert house, the "Elbphilharmonie", payments of 114 Million Euro where agreed upon. In 2013, Hamburg's mayor stated that expected payments are already 789 Millionen Euro. Given that requirements for the construction are quite often adjusted after contracts are signed, the design of an intelligent mechanism for procurement and ex-post compensation is by no means trivial.

Frederik Collin has created a very nice interactive RTutor problem set that allows you to explore the design and outcomes of procurement auctions for constructions and repairs of Californian highways. It was created as part of his Master thesis at Ulm University and is based on the article

"Bidding for Incomplete Contracts: An Empirical Analysis of Adaption Costs", by Patrick Bajari, Stephanie Houghton and Steven Tadelis, American Economic Review, 2014

The article and problem set is based on an awesome data set that contains engineer estimates, detailed itemized bids, background data on bidders, and information of ex-post adjustments and compensations of all Californian highway procurement auctions for several years.

In the interactive problem set, you first learn, with a selected example auction, details about the auction design that uses engineer estimates and itemized bids. You then examine bidding behavior, e.g. how markups across auctions depend on firm characteristics and competitive pressure. You also study in how far firms systematically skew their bids to exploit systematic mistakes in official engineer estimates. Finally, the data on ex-post changes and negotiated compensations is studied. It is estimated that, possibly due to high transaction and haggling costs, firms anticipate ex-post adaptions to be considerably more costly than what they are compensated for.

The problem set incorporates some novel RTutor features, such as quizes and output of data frames as html tables by default…

Screenshoot:

Like previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to procceed.

Screenshoot:

To install the problem set the problem set locally, follow the instructions here:

https://github.com/Fcolli/RTutorProcurementAuction

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 30 hours total usage time per month. So it may be greyed out when you click at it.)

https://fcolli.shinyapps.io/RTutorProcurementAuction

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts.

↧

IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser

December 18, 2015, 5:03 am

≫ Next: Marathon Races Shiny App

≪ Previous: RTutor: Public Procurement Auctions: Design, Outcomes and Adaption Costs

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

One of the many things on my “to do” list is to put together a blogged script that wires together RStudio, Jupyter notebook server, Shiny server, OpenRefine, PostgreSQL and MongDB containers, and perhaps data extraction services like Apache Tika or Tabula and a few OpenRefine style reconciliation services, along with a common shared data container, so the whole lot can be launched on Digital Ocean at a single click to provide a data wrangling playspace with all sorts of application goodness to hand.

(Actually, I think I had a script that was more or less there for chunks of that when I was looking at a docker solution for the databases courses, but that fell by the way side and I suspect the the Jupyter container (IPython notebook server, as was), probably needs a fair bit of updating by now. And I’ve no time or mental energy to look at it right now…:-(

Anyway, the IBM Data Scientist Workbench now sits alongside things like KMis longstanding KMi Crunch Learning Analytics Environment (RStudio + MySQL), and the Australian ResBaz Cloud – Containerised Research Apps Service in my list of why the heck can’t we get our act together to offer this sort of SaaS thing to learners? And yes I know there are cost applications…. but, erm, sponsorship, cough… get-started tokens then PAYG, cough…

It currently offers access to personal persistent storage and the ability to launch OpenRefine, RStudio and Jupyter notebooks:

Data_Scientist_Workbench

The toolbar also suggest that the ability to “discover” pre-identified data sources and run pre-configured modeling tools is also on the cards.

The applications themselves run off a subdomain tied to your account – and of course, they’re all available through the browser…

RStudio_and_OpenRefine_and_Data_Scientist_Workbench_and_Edit_Post_‹_OUseful_Info__the_blog____—_WordPress

So what’s next? I’d quite like to see ‘data import packs’ that would allow me to easily pull in data from particular sources, such as the CDRC, and quickly get started working with the data. (And again: yes, I know, I could start doing that anyway… maybe when I get round to actually doing something with isleofdata.com ?!;-)

See also these recipes for running app containers on Digital Ocean via Tutum: RStudio, Shiny server, OpenRefine and OpenRefine reconciliation services, and these Seven Ways of Running IPython / Jupyter Notebooks.

To leave a comment for the author, please follow the link and comment on their blog: OUseful.Info, the blog... » Rstats.

↧

Marathon Races Shiny App

December 18, 2015, 10:43 pm

≫ Next: 6 benefits of learning Tableau (a BI tool for interactive visualization)

≪ Previous: IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser

(This article was first published on More or Less Numbers, and kindly contributed to R-bloggers)

About a year ago I posted about men’s and women’s marathon (and longer) distance races from the Arrs.net dataset. In the meantime, Shiny development and the open source announcement of plot.ly have brought data visualization to the next level. As an avid (at least former) runner, exploring marathon data is interesting at both the personal and “data science” (or is that personal too?) levels. Thus, I finished a Shiny app that explores this dataset from 2014. Unfortunately, 2015 data is not being updated for one reason or another, but 2014 provides a lot of observations about marathon and longer distance races.

Click the image below to access the app.

www.datavaapps.shinyapps.io/ARRS_dashboard

The values can be toggled between months for 2014 and a searchable table of all the data is below the graph. You will notice many of the points are small ultra-marathons around the world. Plot.ly provides nice graph interactive abilities found when hovering in the upper right corner of the graph.

Thanks to RStudio for all their work on Shiny and to Plot.ly for their plotly package and charting library.

To leave a comment for the author, please follow the link and comment on their blog: More or Less Numbers.

↧

6 benefits of learning Tableau (a BI tool for interactive visualization)

December 19, 2015, 8:51 am

≫ Next: R ahp package on github

≪ Previous: Marathon Races Shiny App

Guest post by Kirill Eremenko

Source: www.tableau.com

I teach Tableau through Udemy, and would like to offer you the following six points to help you decide if Tableau is a tool you should learn:

(If you decide you are interested, readers of R-bloggers may take the course for $15 instead of $300 until December 24th, just click here to make use of this offer)

1. Quickly create interactive plots

Volume, variety, and velocity, right? Today the 3V’s not only define Big Data, but also accurately summarize the projects being thrown at data scientists.

There are lots of them; every business problem is unique and they are coming at you at incredible speeds. I’ve had situations where 2-3 stakeholders came to me with multiple project requests on a daily basis!

So how do you deal with this onslaught of work and produce a great deliverable every time? One way is to get very good at ggplot2, shiny, htmlwidgets, dygraphs, googleVis & co. and hope that your pre-built templates fit the next project that comes your way so you can save some precious development time.

Another way is to use Tableau’s drag-n-drop interface to build many (beautiful) visuals in minutes. The interface can handle endless variations and helps tackle just about any project thrown your way with ease.

When I first start a project what I really want is to SEE my data. Ever since I started using Tableau, the first thing I do on a new project is throw all my data into this magic box. Drag-drop and I can see the trends, drag-drop and there are the anomalies, drag-drop and hmmm, that’s interesting, let me drill into that further… You get the point.

2. Build interactive dashboards using a GUI

With Tableau you can build interactive dashboards to empower your clients. It’s so easy that this has become my default option.

Now when somebody comes to me after project delivery and asks, “Can you do a very quick adjustment for me? Pretty pleeeeeease.” I point them to their dashboard and say “You wanted this? Here you go. It’s interactive, so make as many changes as you like!” (like building a Shiny app using a GUI interface)

The best part is that dashboards can be deployed at enterprise level (Tableau Server) and can be viewed and interrogated on a laptop, tablet, and even mobile. Managing executives of your company are going to be your new best friends! It’s self-Service Analytics at its finest.

Source: www.tableau.com

3. Connects to R

You can perform basic calculations and even run some simple stats in Tableau itself. But if that’s not enough and heavy artillery analytics is required, simply run your models in R, import results into Tableau and visualize away! Need to leverage R computations in real-time? Not a problem! Tableau has in-built support for R via Rserve.

These programs complement each other well and this allows you to harness the power of each for a great end result.

4. Growth

For the third time, Tableau has been named a leader in the Magic Quadrant for Business Intelligence and Analytics Platforms report by Gartner:

Source: Gartner (February 2015)

This doesn’t come as a surprise. With incredible year-on-year growth and record adoption rates globally, Tableau is setting itself up for long-term success. It’s completely expected that the company delivers one of the best analytics and visualization platforms out there.

And when you join the Tableau community, you will set yourself up for long-term success too. Akin to R followers, Tableau fans are extremely passionate about the tool. They are there to help each other and their numbers are growing rapidly.

In fact, Tableau is becoming so popular that many organizations require Tableau on your resume to even apply for their data science positions. I won’t be surprised if in 5 years this will be the norm. Just check out the growth for Tableau search terms on Google Trends:

Source: Google Trends

5. Short learning curve

Tableau is extremely easy to learn. It’s such an intuitive tool that you can pick it up on the fly. With the right type of training, in less than 7 hours you will be seamlessly creating fully interactive MIS dashboards like this:

6. Pricing

The Professional Version of Tableau is priced with enterprises in mind: $1,999 + maintenance fee per license. But here’s the good news: Tableau has a completely free version of their software called Tableau Public.

With Tableau Public you cannot connect to as many Data Sources as with Tableau Professional and all visualizations have to be saved on a public server. Apart from that, Tableau Public is capable of producing the same incredible visualizations and dashboards as Tableau Professional, making it a great solution for learning the software.

Are you interested in learning Tableau with me?

Tableau is coming, and if you envision a successful career as a data scientist, you will inevitably run into it. Businesses are requiring it as a prerequisite for hiring, and clients are demanding the level of simplicity and interactivity that can be produced with this amazing tool.

If you are interested in learning Tableau you can join the five-star professional Udemy training in Tableau via this exclusive invitation: Click here to join the Tableau training. This is a Tableau A-Z training which will take you step-by-step from your first data connection to being fluent with fully interactive MIS dashboards in under 7 hours. The course is extremely hands-on, with ample practice exercises, case studies, and quizzes. Go from novice to expert in less than a single day and start wow-ing your clients with amazing visualizations, all with less work and effort than traditional data analysis methods.

Here is what some of the students say about this training:

“The course is well organized, concise and effectively manages to delve into the introduction of Tableau, making the analysis more fun. Happy to analyze!” -Jigar Shah

“This course is an excellent introduction to Tableau in general and provides a framework to learn it in more detail. If you have experience with other analysis/visualization tools, the format, pace, and examples make learning the basics elements of this tool much easier.” -Gregg Pruitt

“This course on Tableau will get you completely excited about the powerful potential of your everyday reports. A life-changer.” -Irving Weiss

Want to learn more? Click here and enroll in the Tableau class (readers of R-bloggers may take the course for $15 instead of $300 until December 24th, just click here to make use of this offer)

See you in class,

Kirill Eremenko

↧

R ahp package on github

December 20, 2015, 11:00 pm

≫ Next: Analyzing “Twitter faces” in R with Microsoft Project Oxford

≪ Previous: 6 benefits of learning Tableau (a BI tool for interactive visualization)

(This article was first published on ipub » R, and kindly contributed to R-bloggers)

AHP lets you analyse complex decision making problems. We have recently released the initial version of the R ahp package on github: gluc/ahp.

What is AHP?

The Analytic Hierarchy Process is a decision making framework developed by Thomas Saaty. Read this entry on Wikipedia for more information.

There is commercial software available to use this methodology for complex decision making problems. In R, there are a few packages that help with the calculation part. However, there has not been a framework to model entire AHP problems. This is the goal of the ahp package.

How to get started

For more information, see the package vignette, either using vignette("car-example"), or on rpubs. That vignette models the well-known AHP example, which is, for example, explained here.

To install the package and read the vignette, you want to do this:

devtools::install_github("gluc/ahp", build_vignettes = TRUE)
vignette("car-example", package = "ahp")

Modeling Analytic Hierarchy Process problems

Once the package is installed, you can run the sample yourself, like so:

library(ahp)
ahpFile <- system.file("extdata", "car.ahp", package="ahp")
carAhp <- LoadFile(ahpFile)
Calculate(carAhp)

This illustrates the basic workflow, which is:

The basic workflow with this package is:

specify your ahp problem in an ahp file
load an ahp file, using LoadFile
calculate model, using Calculate
output model analysis, either using GetDataFrame or using ShowTable

ahp File Format

The entire ahp problem is specified in a single file, in YAML format. For an example, see here.

Analysis

GetDataFrame

There are two options for the analysis. The first one prints to the console:

GetDataFrame(carAhp)

What you will see is something like this:

Weight Odyssey Accord Sedan  CR-V Accord Hybrid Element Pilot Consistency
1  Buy Car                    100.0%   21.8%        21.6% 16.3%         15.1%   14.7% 10.6%        7.4%
2   ¦--Cost                    51.0%    5.7%        12.3% 11.7%          5.8%   12.5%  3.0%        1.5%
3   ¦   ¦--Purchase Price      24.9%    2.3%         6.1%  6.1%          0.6%    9.1%  0.6%        6.8%
4   ¦   ¦--Fuel Cost           12.8%    2.0%         2.4%  2.1%          2.7%    1.9%  1.7%        0.0%
5   ¦   ¦--Maintenance Cost     5.1%    0.3%         1.8%  0.5%          1.6%    0.4%  0.4%        2.3%
6   ¦   °--Resale Value         8.2%    1.1%         1.9%  2.9%          0.9%    1.1%  0.3%        3.2%
7   ¦--Safety                  23.4%   10.2%         5.1%  0.8%          5.1%    0.5%  1.8%        8.1%
8   ¦--Style                    4.1%    0.3%         1.5%  0.6%          1.5%    0.1%  0.2%       10.2%
9   °--Capacity                21.5%    5.7%         2.8%  3.1%          2.8%    1.5%  5.6%        0.0%
10      ¦--Cargo Capacity       3.6%    0.8%         0.3%  0.7%          0.3%    0.7%  0.7%        0.4%
11      °--Passenger Capacity  17.9%    4.9%         2.4%  2.4%          2.4%    0.8%  4.9%        0.0%

The Odyssey comes out first, but only slightly better than the Accord Sedan.

ShowTable

The ShowTable method displays the same analysis as an html table, using color codes:

ShowTable(carAhp)

Here, it’s easy to see that the Odyssey is more expensive then the Accord Sedan, however Safety and Passenger Capacity make more than up for this.

Also note the exclamation mark at the Style Consistency: A consistency ratio of above 10% is considered inconsistent, and we might want to review the style preferences in the ahp file.

As a side note: This table is generated with the fantastic formattable package.

Feedback and Future Developments

This is an early version, and thus feedback is more than welcome. You can ask questions and open issues directly on github issues page. Also, I appreciate if you star the package on github if you like.

The plan is to add a few test cases, error handling, and submit it to CRAN. A future development I am currently considering is to add a simple Shiny app to the package, letting you specify and tweak your file directly in your web browser, and showing the analysis also in your Browser.

The post R ahp package on github appeared first on ipub.

To leave a comment for the author, please follow the link and comment on their blog: ipub » R.

↧

Analyzing “Twitter faces” in R with Microsoft Project Oxford

December 21, 2015, 1:15 am

≫ Next: Shiny https: Securing Shiny Open Source with SSL

≪ Previous: R ahp package on github

(This article was first published on Longhow Lam's Blog » R, and kindly contributed to R-bloggers)

Introduction

In my previous blog post I used the Microsoft Translator API in my BonAppetit Shiny app to recommend restaurants to tourists. I’m getting a little bit addicted to the Microsoft API’s, they can be fun to use . In this blog post I will briefly describe some of the Project Oxford API’s of Microsoft.

The API’s can be called from within R, and if you combine them with other API’s, for example Twitter, then interesting “Twitter face” analyses can be done. See my “TweetFace” shiny app to analyse faces that can be found on Twitter.

Project Oxford

The API’s of Project Oxford can be categorized into:

Computer Vision,
Face,
Video,
Speech and
Language.

The free tier subscription provides 5000 API calls per month (with a rate limit of 20 calls per minute). I focused my experiments on the computer vision and face API’s, a lot of functionality is available to analyze images. For example, categorization of images, adult content detection, OCR, face recognition, gender analysis, age estimation and emotion detection.

Calling the API’s from R

The httr package provides very convenient functions to call the Microsoft API’s. You need to sign-up first and obtain a key. Let’s do a simple test on Angelina Jolie by using the face detect API.

Angelina Jolie, picture link

library(httr)

faceURL = "https://api.projectoxford.ai/face/v1.0/detect?returnFaceId=true&returnFaceLandmarks=true&returnFaceAttributes=age,gender,smile,facialHair"
img.url = 'http://www.buro247.com/images/Angelina-Jolie-2.jpg'

faceKEY = '123456789101112131415'

mybody = list(url = img.url)

faceResponse = POST(
  url = faceURL, 
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
)
faceResponse
Response [https://api.projectoxford.ai/face/v1.0/detect?returnFaceId=true&returnFaceLandmarks=true&returnFaceAttributes=age,gender,smile,facialHair]
Date: 2015-12-16 10:13
Status: 200
Content-Type: application/json; charset=utf-8
Size: 1.27 kB

If the call was successful a “Status: 200” is returned and the response object is filled with interesting information. The API returns the information as JSON which is parsed by R into nested lists.


AngelinaFace = content(faceResponse)[[1]]
names(AngelinaFace)
[1] "faceId"  "faceRectangle" "faceLandmarks" "faceAttributes"

AngelinaFace$faceAttributes
$gender
[1] "female"

$age
[1] 32.6

$facialHair
$facialHair$moustache
[1] 0

$facialHair$beard
[1] 0

$facialHair$sideburns
[1] 0

Well, the API recognized the gender and that there is no facial hair , but her age is under estimated, Angelina is 40 not 32.6! Let’s look at emotions, the emotion API has its own key and url.


URL.emoface = 'https://api.projectoxford.ai/emotion/v1.0/recognize'

emotionKey = 'ABCDEF123456789101112131415'

mybody = list(url = img.url)

faceEMO = POST(
  url = URL.emoface,
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = emotionKEY)),
  body = mybody,
  encode = 'json'
)
faceEMO
AngelinaEmotions = content(faceEMO)[[1]]
AngelinaEmotions$scores
$anger
[1] 4.573111e-05

$contempt
[1] 0.001244121

$disgust
[1] 0.0001096572

$fear
[1] 1.256477e-06

$happiness
[1] 0.0004313129

$neutral
[1] 0.9977798

$sadness
[1] 0.0003823086

$surprise
[1] 5.75276e-06

A fairly neutral face. Let’s test some other Angelina faces

angelina2

Find similar faces

A nice piece of functionality of the API is finding similar faces. First a list of faces needs to be created, then with a ‘query face’ you can search for similar-looking faces in the list of faces. Let’s look at the most sexy actresses.


## Scrape the image URLs of the actresses
library(rvest)

linksactresses = 'http://www.imdb.com/list/ls050128191/'

out = read_html(linksactresses)
images = html_nodes(out, '.zero-z-index')
imglinks = html_nodes(out, xpath = "//img[@class='zero-z-index']/@src") %>% html_text()

## additional information, the name of the actress
imgalts = html_nodes(out, xpath = "//img[@class='zero-z-index']/@alt") %>% html_text()

Create an empty list, by calling the facelist API, you should spcify a facelistID, which is placed as request parameter behind the facelist URL. So my facelistID is “listofsexyactresses” as shown in the code below.

### create an id and name for the face list
URL.face = "https://api.projectoxford.ai/face/v1.0/facelists/listofsexyactresses"

mybody = list(name = 'top 100 of sexy actresses')

faceLIST = PUT(
  url = URL.face,
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
)
faceLIST
Response [https://api.projectoxford.ai/face/v1.0/facelists/listofsexyactresses]
Date: 2015-12-17 15:10
Status: 200
Content-Type: application/json; charset=utf-8
Size: 108 B

Now fill the list with images, the API allows you to provide user data with each image, this can be handy to insert names or other info. So for one image this works as follows

i=1
userdata = imgalts[i]
linkie = imglinks[i]
face.uri = paste(
  'https://api.projectoxford.ai/face/v1.0/facelists/listofsexyactresses/persistedFaces?userData=',
  userdata,
  sep = ";"
)
face.uri = URLencode(face.uri)
mybody = list(url = linkie )

faceLISTadd = POST(
  url = face.uri,
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
)
faceLISTadd
print(content(faceLISTadd))
Response [https://api.projectoxford.ai/face/v1.0/facelists/listofsexyactresses/persistedFaces?userData=Image%20of%20Naomi%20Watts]
Date: 2015-12-17 15:58
Status: 200
Content-Type: application/json; charset=utf-8
Size: 58 B

$persistedFaceId
[1] '32fa4d1c-da68-45fd-9818-19a10beea1c2'

## status 200 is OK

Just loop over the 100 faces to complete the face list. With the list of images we can now perform a query with a new ‘query face’. Two steps are needed, first call the face detect API to obtain a face ID. I am going to use the image of Angelina, but a different one than the image on IMDB.


faceDetectURL = 'https://api.projectoxford.ai/face/v1.0/detect?returnFaceId=true&returnFaceLandmarks=true&returnFaceAttributes=age,gender,smile,facialHair'
img.url = 'http://a.dilcdn.com/bl/wp-content/uploads/sites/8/2009/06/angelinaangry002.jpg'

mybody = list(url = img.url)

faceRESO = POST(
  url = faceDetectURL,
  content_type('application/json'), add_headers(.headers =  c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
)
faceRESO
fID = content(faceRESO)[[1]]$faceId

With the face ID, query the face list with the “find similar” API. There is a confidence of almost 60%.


sim.URI = 'https://api.projectoxford.ai/face/v1.0/findsimilars'

mybody = list(faceID = fID, faceListID = 'listofsexyactresses' )

faceSIM = POST(
  url = sim.URI,
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
)
faceSIM
yy = content(faceSIM)
yy
[[1]]
[[1]]$persistedFaceId
[1] "6b4ff942-b216-4817-9739-3653a467a594"

[[1]]$confidence
[1] 0.5980769

The picture below shows some other matches…..

matches

Conclusion

The API’s of Microsoft’s Project Oxford provide nice functionality for computer vision, face analysis. It’s fun to use them, see my ‘TweetFace’ Shiny app to analyse images on Twitter.

Cheers,

Longhow

To leave a comment for the author, please follow the link and comment on their blog: Longhow Lam's Blog » R.

↧

Shiny https: Securing Shiny Open Source with SSL

December 22, 2015, 10:30 pm

≫ Next: Analyzing networks of characters in ‘Love Actually’

≪ Previous: Analyzing “Twitter faces” in R with Microsoft Project Oxford

(This article was first published on ipub » R, and kindly contributed to R-bloggers)

As described in my Shiny overview post, there are different versions of Shiny server. Among other limitations, the open source flavor does not come with built-in support for https and user access control. In this post, we explain how you can nevertheless turn your Shiny Open Source server into a Shiny https server.

This tutorial builds on previous tutorials, namely:

Setting up an AWS instance for R

Installing Shiny Server on AWS

In a future post, we will explain how you can secure Shiny Server Open Source with user/password access.

This tutorial builds on Amazon AWS. But it is easy to adopt it to other cloud services or a local machine.

Https, what’s that?

Https is a protocol that encrypts your communication with a web server. For a one-minute definition, see here. Https can be useful for two things:

So nobody can read the communication between you and the web server
So you can be sure that you are really really talking to the desired web server, and not to a fake (a so-called man-in-the-middle)

What is an SSL certificate?

An SSL certificate binds together a domain name, a publisher of data or services, and a cryptographic key. In simple terms, it makes sure that when you connect to www.gmail.com your are really connected with google, and not with somebody who pretends to be google.

An SSL certificate is required for any https communication. Typically, you would buy an SSL certificate from a certification authority like VeriSign, Comodo, digicert, or many others. They come in various flavors and prices, and your browser typically reacts differently based on the strength of the certificate (e.g. by displaying an orange lock, a green lock, or a warning page).

For the sake of this tutorial, we will create our own SSL certificate. It will make sure that the communication to our Shiny server is encrypted. However, most browsers will display a warning when you access your Shiny app. The reason for this is that no certification authority has checked your identity. Feel free, however, to replace the the SSL certificate with a commercially obtained one.

Why would I want to have a Shiny https server?

Without a password-protected Shiny server, there are no secrets, really. So, encrypting the communication does not seem to be overly important, right?

However, making sure that your users are indeed talking to you might be important. For example, consider the case of a finance researcher that publishes on a regular basis a widely used index on company data. If people are using this data e.g. for trading, they want to be sure that the source of the data is indeed the researcher, and not an ill-natured man in the middle that wants to influence the markets to his benefit.

Shiny https (based on Shiny Server Open Source) vs. Shiny Pro?

If you are working for a company and manage to convince your boss to buy a license of Shiny Pro, by all means do that. It is a fine product and gives you advantages that go beyond securing the communication with https. The same is true for a subscription to shinyapps.io

However, if you do not have access to these financial resources, e.g. because you’re in academics or open-source development, and if you are only interested to secure your connection, then this step-by-step guide is for you.

Architecture

This guide uses Apache Tomcat and Amazon AWS. There are other options to achieve the same thing, but to keep the guide short we do not list them. With a bit of googling, you should be able to adapt this to other scenarios.

Our set-up will look like this:

The numbers correspond to the configuration steps we’ll follow in this guide. Specifically:

Set up an AWS EC2 Ubuntu instance. If you haven’t done so, this tutorial tells you how.
Set up Shiny Server Open Source: After this step, you should be able to check your setup with a regular http configuration. How to get there is explained in this tutorial.
Set up AWS Firewall to only allow connections to our https port. This will block direct http access to the Shiny Server
Install an SSL Certificate: Here, we’ll install a free, self-generated certificate. But if you want, you can install a bought SSL certificate that makes sure the user will not get any warnings
Install Apache Tomcat, who will manage the incoming https connections,
Configure Apache Tomcat to proxy, i.e. to translate/forward incoming https connections to http Shiny Server

If you know your way around AWS and Linux, you’ll be able to finish the entire set-up in about 15 minutes. If this is all news to you, count on spending one or two hours until everything is working properly.

Step-by-step guide

1. Create AWS EC2 Ubuntu instance

Again, see here.

2. Install Shiny Server

If you haven’t done so, check out this post.

3. Block http by configuring firewall

Log into the AWS management console and go to EC2. If you don’t know what the security group of your instance is, go to Instances and select your instance. In the bottom part, you’ll find the Security Group. Click on it. This will get you to the Security Groups. Now, do two things:

in Inbound Rules, remove the 3838 custom rule we had open to access Shiny server over http
add an HTTPS rule

Your security group should look similar to this:

AWS Security Group settings: Open the HTTPS port for our Shiny https server

Now, try to access your Shiny Server by typing either

http://ec2-52-59-246-209.eu-central-1.compute.amazonaws.com:3838/

http://ec2-52-59-246-209.eu-central-1.compute.amazonaws.com/

Replace the Public IP of your instance, of course. You should get an error page for both cases.

4. Install an SSL certificate

Here, we will create our own SSL certificate. However, for real-world cases, you should instead install a commercially bought SSL certificate. Most commercial CAs provide extensive help on how to install their certificates.

A second point of importance: In a real world scenario, you would secure a domain name that you own. Note, however, that AWS will assign a new IP address when you stop and re-start your instance. So, if you intend to go beyond just trying it out once, then make sure you at least reserve an elastic IP, so you can keep using your certificate even if you need to restart your instance. An elastic IP is an IP address reserved for you, for use on AWS. As IP addresses are scarce, AWS charges you for not using them.

SSH into your instance and perform the following steps.

First, make sure your commands are from su, which avoids typing sudo all the time:

sudo -i

Next, we generate a key:

openssl genrsa -out /etc/ssl/private/apache.key 2048

Finally, we create our SSL certificate, using the key we have just created. Type:

openssl req -new -x509 -key /etc/ssl/private/apache.key -days 365 -sha256 -out /etc/ssl/certs/apache.crt

This will ask you a few questions. The only crucial part is the Common Name. Here you need to enter the public DNS name or the public IP of your AWS instance. Again, note, that normally you would enter a domain name that you own, e.g. ‘shiny.ipub.com’ in my case. If you are just goofing around, enter the public DNS of your instance:

5. Install Apache

apt-get install apache2

aptitude install -y build-essential aptitude install -y libapache2-mod-proxy-html libxml2-dev

You should now have apache installed. To install the ssl and proxy modules in apache, run the following command:

a2enmod

This will open a dialog that asks you which modules you would like to install. Type the following:

ssl proxy proxy_ajp proxy_http rewrite deflate headers proxy_balancer proxy_connect proxy_html

6. Configure the Reverse Proxy

Last thing to do is to configure apache to forward https calls to http port 3838. Here, we do this globally, but if you want to use your apache for something else, too, you will want to do some reading up on apache configuration.

Type:

nano /etc/apache2/sites-enabled/000-default.conf

This will open a simple text editor.

Your config file should look like this:

&lt;VirtualHost *:*&gt;
 SSLEngine on
 SSLCertificateFile /etc/ssl/certs/apache.crt
 SSLCertificateKeyFile /etc/ssl/private/apache.key

 ProxyPreserveHost On
 ProxyPass / http://0.0.0.0:3838/
 ProxyPassReverse / http://0.0.0.0:3838/

 ServerName localhost
&lt;/VirtualHost&gt;

This does two things:

it turns on SSL for our apache server, pointing to the certificate previously created.
It forwards all incoming https calls to http port 3838, the default Shiny server port

Save by hitting Ctrl+O and Enter.

Finally, you need to restart apache:

service apache2 restart

Test your Shiny https server

Try connecting to your Shiny server by typing:

https://ec2-52-59-246-209.eu-central-1.compute.amazonaws.com/

Of course, you need to replace the public dns name with the one of your instance.

Remember that, with a self-generated SSL certificate, most browsers display a warning. If you dare insisting to proceed, however, you will see something like this:

And, if you click on the little lock on the left of the address bar, then you’ll see that your communication is encrypted, though the identity is not entrusted:

And here we go, you have your Shiny https server! That’s all you need to add encryption to your Shiny server.

In a future post, we’ll add users and passwords to our Shiny https server.

The post Shiny https: Securing Shiny Open Source with SSL appeared first on ipub.

To leave a comment for the author, please follow the link and comment on their blog: ipub » R.

↧

Analyzing networks of characters in ‘Love Actually’

December 24, 2015, 11:30 pm

≫ Next: How to create a Twitter Sentiment Analysis using R and Shiny

≪ Previous: Shiny https: Securing Shiny Open Source with SSL

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Every Christmas Eve, my family watches Love Actually. Objectively it’s not a particularly, er, good movie, but it’s well-suited for a holiday tradition. (Vox has got my back here).

Even on the eighth or ninth viewing, it’s impressive what an intricate network of characters it builds. This got me wondering how we could visualize the connections quantitatively, based on how often characters share scenes. So last night, while my family was watching the movie, I loaded up RStudio, downloaded a transcript, and started analyzing.

Parsing

It’s easy to use R to parse the raw script into a data frame, using a combination of dplyr, stringr, and tidyr. (For legal reasons I don’t want to host the script file myself, but it’s literally the first Google result for “Love Actually script.” Just copy the .doc contents into a text file called love_actually.txt).

library(dplyr)
library(stringr)
library(tidyr)

raw <- readLines("love_actually.txt")

lines <- data_frame(raw = raw) %>%
    filter(raw != "", !str_detect(raw, "(song)")) %>%
    mutate(is_scene = str_detect(raw, " Scene "),
           scene = cumsum(is_scene)) %>%
    filter(!is_scene) %>%
    separate(raw, c("speaker", "dialogue"), sep = ":", fill = "left") %>%
    group_by(scene, line = cumsum(!is.na(speaker))) %>%
    summarize(speaker = speaker[1], dialogue = str_c(dialogue, collapse = " "))

I also set up a CSV file matching characters to their actors, which you can read in separately. (I chose 20 characters that have notable roles in the story).

cast <- read.csv(url("http://varianceexplained.org/files/love_actually_cast.csv"))

lines <- lines %>%
    inner_join(cast) %>%
    mutate(character = paste0(speaker, " (", actor, ")"))

Now we have a tidy data frame with one row per line, along with columns describing the scene number and characters:

lines data.frame

From here it’s easy to count the lines-per-scene-per-character, and to turn it into a binary speaker-by-scene matrix.

by_speaker_scene <- lines %>%
    count(scene, character)

by_speaker_scene

## Source: local data frame [162 x 3]
## Groups: scene [?]
## 
##    scene                character     n
##    (int)                    (chr) (int)
## 1      2       Billy (Bill Nighy)     5
## 2      2      Joe (Gregor Fisher)     3
## 3      3      Jamie (Colin Firth)     5
## 4      4     Daniel (Liam Neeson)     3
## 5      4    Karen (Emma Thompson)     6
## 6      5    Colin (Kris Marshall)     4
## 7      6    Jack (Martin Freeman)     2
## 8      6       Judy (Joanna Page)     1
## 9      7    Mark (Andrew Lincoln)     4
## 10     7 Peter (Chiwetel Ejiofor)     4
## ..   ...                      ...   ...

library(reshape2)
speaker_scene_matrix <- by_speaker_scene %>%
    acast(character ~ scene, fun.aggregate = length)

dim(speaker_scene_matrix)

## [1] 20 76

Now we can get to the interesting stuff!

Analysis

Whenever we have a matrix, it’s worth trying to cluster it. Let’s start with hierarchical clustering.¹

norm <- speaker_scene_matrix / rowSums(speaker_scene_matrix)

h <- hclust(dist(norm, method = "manhattan"))

plot(h)

center

This looks about right! Almost all the romantic pairs are together (Natalia/PM; Aurelia/Jamie, Harry/Karen; Karl/Sarah; Juliet/Peter; Jack/Judy) as are the friends (Colin/Tony; Billy/Joe) and family (Daniel/Sam).

One thing this tree is perfect for is giving an ordering that puts similar characters close together:

ordering <- h$labels[h$order]
ordering

##  [1] "Natalie (Martine McCutcheon)" "PM (Hugh Grant)"             
##  [3] "Aurelia (Lúcia Moniz)"        "Jamie (Colin Firth)"         
##  [5] "Daniel (Liam Neeson)"         "Sam (Thomas Sangster)"       
##  [7] "Jack (Martin Freeman)"        "Judy (Joanna Page)"          
##  [9] "Colin (Kris Marshall)"        "Tony (Abdul Salis)"          
## [11] "Billy (Bill Nighy)"           "Joe (Gregor Fisher)"         
## [13] "Mark (Andrew Lincoln)"        "Juliet (Keira Knightley)"    
## [15] "Peter (Chiwetel Ejiofor)"     "Karl (Rodrigo Santoro)"      
## [17] "Sarah (Laura Linney)"         "Mia (Heike Makatsch)"        
## [19] "Harry (Alan Rickman)"         "Karen (Emma Thompson)"

This ordering can be used to make other graphs more informative. For instance, we can visualize a timeline of all scenes:

scenes <- by_speaker_scene %>%
    filter(n() > 1) %>%        # scenes with > 1 character
    ungroup() %>%
    mutate(scene = as.numeric(factor(scene)),
           character = factor(character, levels = ordering))

ggplot(scenes, aes(scene, character)) +
    geom_point() +
    geom_path(aes(group = scene))

center

If you’ve seen the film as many times as I have (you haven’t), you can stare at this graph and the film’s scenes spring out, like notes engraved in vinyl.

One reason it’s good to lay out raw data like this (as opposed to processed metrics like distances) is that anomalies stand out. For instance, look at the last scene: it’s the “coda” at the airport that includes 15 (!) characters. If we’re going to plot this as a network (and we totally are!) we’ve got to ignore that scene, or else it looks like almost everyone is connected to everyone else.

After that, we can create a cooccurence matrix (see here) containing how many times two characters share scenes:

non_airport_scenes <- speaker_scene_matrix[, colSums(speaker_scene_matrix) < 10]

cooccur <- non_airport_scenes %*% t(non_airport_scenes)

heatmap(cooccur)

center

This gives us a sense of how the clustering in the above graph occurred. We can then use the igraph package to plot the network.

library(igraph)
g <- graph.adjacency(cooccur, weighted = TRUE, mode = "undirected", diag = FALSE)
plot(g, edge.width = E(g)$weight)

center

A few patterns pop out of this visualization. We see that the majority of characters are tightly connected (often by the scenes at the school play, or by Karen (Emma Thompson), who is friends or family to many key characters). But we see Bill Nighy’s plotline occurs almost entirely separate from everyone else, and that five other characters are linked to the main network by only a single thread (Sarah’s conversation with Mark at the wedding).

One interesting aspect of this data is that this network builds over the course of the movie, growing nodes and connections as characters and relationships are introduced. There are a few ways to show this evolving network (such as an animation), but I decided to make it an interactive Shiny app, which lets the user specify the scene and shows the network that the movie has built up to that point.

(You can view the code for the Shiny app on GitHub).

Data Actually

Have you heard the complaint that we are “drowning in data”? How about the horror stories about how no one understands statistics, and we need trained statisticians as the “police” to keep people from misinterpreting their methods? It sure makes data science sound like important, dreary work.

Whenever I get gloomy about those topics, I try to spend a little time on silly projects like this, which remind me why I learned statistical programming in the first place. It took minutes to download a movie script and turn it into usable data, and within a few hours, I was able to see the movie in a new way. We’re living in a wonderful world: one with powerful tools like R and Shiny, and one overflowing with resources that are just a Google search away.

Maybe you don’t like ‘Love Actually’; you like Star Wars. Or you like baseball, or you like comparing programming languages. Or you’re interested in dating, or hip hop. Whatever questions you’re interested in, the answers are just a search and a script away. If you look for it, I’ve got a sneaky feeling you’ll find that data actually is all around us.

Footnotes

We made a few important choices in our clustering here. First, we normalized so that the number of scenes for each character adds up to 1: otherwise, we wouldn’t be clustering based on a character’s distribution across scenes so much as the number of scenes they’re in. Secondly, we used Manhattan distance, which for a binary matrix means “how many scenes is one of these characters in that the other isn’t”. Try varying these approaches to see how the clusters change! ↩

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

↧

How to create a Twitter Sentiment Analysis using R and Shiny

December 26, 2015, 12:32 pm

≫ Next: R / Shiny Poll Results

≪ Previous: Analyzing networks of characters in ‘Love Actually’

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Everytime you release a product or service you want to receive feedback from users so you know what they like and what they don’t. Sentiment Analysis can help you. I will show you how to create a simple application in R and Shiny to perform Twitter Sentiment Analysis in real-time. I use RStudio.

We will be able to see if they liked our products or not. Also, we will create a wordcloud to find out why they liked it and why not.

First, I will create a Shiny Project. To learn how to create a Shiny apps you might read this tutorial by Teja Kodali and another tutorial by Aaron Gowins.

Then, in the ui.R file, I put this code:

shinyUI(fluidPage( 
titlePanel("Sentiment Analysis"), #Title
textOutput("currentTime"),   #Here, I show a real time clock
h4("Tweets:"),   #Sidebar title
sidebarLayout(
sidebarPanel(
dataTableOutput('tweets_table') #Here I show the users and the sentiment
),

Show a plot of the generated distribution:

mainPanel(
plotOutput("distPlot"), #Here I will show the bars graph
sidebarPanel(
plotOutput("positive_wordcloud") #Cloud for positive words
),
sidebarPanel(
plotOutput("negative_wordcloud") #Cloud for negative words
),
sidebarPanel(
plotOutput("neutral_wordcloud") #Cloud for neutral words
)))))

Here, I will show a title, the current time, a table with Twitter user name, a bar graph and wordclouds. Also you have to put your consumer key and secret (replace xxxxxxxxxx). You will have to create and application in Twitter Developers site and then extract this info.

Now, I will create the server side:

library(shiny) 
library(tm)
library(wordcloud)
library(twitteR)
shinyServer(function(input, output, session) {
setup_twitter_oauth(consumer_key = "xxxxxxxxxxxx", consumer_secret = "xxxxxxxxxxxx") 
token <- get("oauth_token", twitteR:::oauth_cache) #Save the credentials info
token$cache()
output$currentTime <- renderText({invalidateLater(1000, session) #Here I will show the current time
paste("Current time is: ",Sys.time())})
observe({
invalidateLater(60000,session)
count_positive = 0
count_negative = 0
count_neutral = 0
positive_text <- vector()
negative_text <- vector()
neutral_text <- vector()
vector_users <- vector()
vector_sentiments <- vector()
tweets_result = ""
tweets_result = searchTwitter("word-or-expression-to-evaluate") #Here I use the searchTwitter function to extract the tweets
for (tweet in tweets_result){
print(paste(tweet$screenName, ":", tweet$text))
vector_users <- c(vector_users, as.character(tweet$screenName)); #save the user name
if (grepl("I love it", tweet$text, ignore.case = TRUE) == TRUE | grepl("Wonderful", tweet$text, ignore.case = TRUE) | grepl("Awesome", tweet$text, ignore.case = TRUE)){ #if positive words match...
count_positive = count_positive + 1 # Add the positive counts
vector_sentiments <- c(vector_sentiments, "Positive") #Add the positive sentiment
positive_text <- c(positive_text, as.character(tweet$text)) # Add the positive text
} else if (grepl("Boring", tweet$text, ignore.case = TRUE) | grepl("I'm sleeping", tweet$text, ignore.case = TRUE)) { # Do the same for negatives 
count_negative = count_negative + 1
vector_sentiments <- c(vector_sentiments, "Negative")
negative_text <- c(negative_text, as.character(tweet$text))
} else { #Do the same for neutrals
count_neutral = count_neutral + 1
print("neutral")
vector_sentiments <- c(vector_sentiments, "Neutral")
neutral_text <- c(neutral_text, as.character(neutral_text))
}
}
df_users_sentiment <- data.frame(vector_users, vector_sentiments) 
output$tweets_table = renderDataTable({
      df_users_sentiment
    })
    output$distPlot  0){
        output$positive_wordcloud  0) {
        output$negative_wordcloud  0){
        output$neutral_wordcloud <- renderPlot({ wordcloud(paste(neutral_text, collapse=" "), min.freq = 0, random.color=TRUE , max.words=100 ,colors=brewer.pal(8, "Dark2"))  }) 
      }
    })
  })
})
})

Here a screenshot of the shiny app we created:

It’s a really simply code, not complex at all. The purpose of it is just for testing and so you guys can practice R language. If you have questions just let me know.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

R / Shiny Poll Results

December 28, 2015, 5:15 pm

≫ Next: Our R package roundup

≪ Previous: How to create a Twitter Sentiment Analysis using R and Shiny

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

A few days ago I posted a poll directed towards R (and Shiny Users). Thank you to all who participated for your time and thoughtful responses. A RMarkdown Report (Code on Github) highlights the results to the easy-to-summarize questions. A few interesting insights:

78% of R users use Windows. The largest group of Windows users uses Windows alone, followed by Windows and Linux, Windows and Mac, and all three platforms.
Shiny’s appeal is evident among those polled as 1/3 report no experience with web technologies. There is a tech heavy presence however since the remaining 2/3 are conversant with web technologies – almost 20% of those polled could be described as “full-stack developers”.
Most of the response to the poll occurred over about 3 days… though responses continued to be taken for several more days.

There are a ton of fascinating insights in the free-form responses. The R community remains a diverse, difficult to categorize group of individuals that share a common appreciation for R. Shiny has generated a lot of excitement. New developments for the platform largely seem to line up with the interests of the community who want to see easier development, more interactivity and additional options for deployment.

I had hoped to do additional analysis (and still might) but figured folks would be interested in the results. I’d be interested to see what other insights folks might derive from the Raw Data in the Google Spreadsheet.

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

↧

Our R package roundup

December 30, 2015, 10:00 am

≫ Next: Write in-line equations in your Shiny application with MathJax

≪ Previous: R / Shiny Poll Results

(This article was first published on Opiate for the masses, and kindly contributed to R-bloggers)

A year in review

It’s the time of the year again where one eats too much, and gets in a reflective mood! 2015 is nearly over, and us bloggers here at opiateforthemass.es thought it would be nice to argue endlessly which R package was the best/neatest/most fun/most useful/most whatever in this year!

Since we are in a festive mood, we decided we would not fight it out but rather present our top five of new R packages, a purely subjective list of packages we (and Chuck Norris) approves of.

gif

But do not despair, dear reader! We have also pulled hard data on R package popularity from CRAN, and will present this first.

Top Popular CRAN packages

Let’s start with some factual data before we go into our personal favourites of 2015. We’ll pull the titles of the new 2015 R packages from cranberries, and parse the CRAN downloads per day using cranlogs package.

Using downloads per day as a ranking metric could have the problem that earlier package releases have had more time to create a buzz and shift up the average downloads per day, skewing the data in favour of older releases. Or, it could have the complication that younger package releases are still on the early “hump” part of the downloads (let’s assume they’ll follow a log-normal (exponential decay) distribution, which most of these things do), thus skewing the data in favour of younger releases. I don’t know, and this is an interesting question I think we’ll tackle in a later blog post…

For now, let’s just assume that average downloads per day is a relatively stable metric to gauge package success with. We’ll grab the packages released using rvest:

berries <- read_html("http://dirk.eddelbuettel.com/cranberries/2015/")
titles <- berries %>% html_nodes("b") %>% html_text
new <- titles[grepl("^New package", titles)] %>% 
  gsub("^New package (.*) with initial .*", "\1", .) %>% unique

and then lapply() these titles into the CRAN and parse their respective average downloads per day:

logs <- pblapply(new, function(x) {
  down <- cran_downloads(x, from = "2015-01-01")$count 
  if(sum(down) > 0) {
    public <- down[which(down > 0)[1]:length(down)]
  } else {
    public <- 0
  }
  return(data.frame(package = x, sum = sum(down), avg = mean(public)))
})

logs <- do.call(rbind, logs)

With some quick dplyr and ggplot magic, these are the top 20 new CRAN packages from 2015, by average number of daily downloads:

top 20 new CRAN packages in 2015

The full code is availble on github, of course.

As we can see, the main bias does not come from our choice of ranking metric, but by the fact that some packages are more “under the hood” and are pulled by many packages as dependencies, thus inflating the download statistics.

The top four packages (rversions, xml2, git2r, praise) are all technical packages. Although I have to say I did not know of praise so far, and it looks like it’s a very fun package, indeed: you can automatically add randomly generated praises to your output! Fun times ahead, I’d say.

Excluding these, the clear winner of “frontline” packages are readxl and readr, both packages by Hadly Wickham dealing with importing data into R. Well-deserved, in our opinion. These are packages nearly everybody working with data will need on a daily basis. Although, one hopes that contact with Excel sheets is kept to a minimum to ensure one’s sanity, and thus readxl is needed less often in daily life!

The next two are packages (DiagrammeR and visNetwork) relate to network diagrams, something that seems to be en vogue currently. R is getting some much-needed features on these topics here it seems.

plotly is the R package to the recently open-sourced popular plot.ly javascript libraries for interactive charts. A well-deserved top ranking entry! We also see packages that build and improve the ever-popular shiny packages (DT and shinydashboard), leaflet dealing with interactive mapping issues, and packages on stan, the Baysian statistical interference language (rstan, StanHeaders).

But now, this blog’s authors’ personal top five of new R packages for 2015:

readr

(safferli’s pick)

readr is our package pick that also made it into the top downloads metric, above. Small wonder, as it’s written by Hadley and aims to make importing data easier, and especially more consistent. It is thus immediately useful for most, if not all, R users out there, and also received a tremendous “fame kickstart” from Hadley’s reputation within the R community. For extremely large datasets I still like to use data.table’s fread() function, but for anything else the new read_* functions make your life considerably easier. They’re faster compared to base R, and just the no more worries of stringsAsFactors alone is a godsend.

Since the package is written by Hadley, it is not only great but also comes with a fantastic documentation. If you’re not using readr currently, you should head over the the package readme and check it out.

infuser

(Yuki’s pick)

R already has many template engines but this one is simple yet quite useful if you work on data exploration, visualization, statistics in R and deploy your findings in Python while using the same SQL queries and as similar syntax as possible.

Code transition from R to Python is quick and easy with infuser like this now;

# R
library(infuser)
template <- "SELECT {{var}} FROM {{table}} WHERE month = {{month}}"
query <- infuse(template,var="apple",table="fruits",month=12)
cat(query)
# SELECT apple FROM fruits WHERE month = 12

# Python
template = "SELECT {var} FROM {table} WHERE month = {month}"
query = template.format(var="apple",table="fruits",month=12)
print(query)
# SELECT apple FROM fruits WHERE month = 12

googlesheets

(Kirill’s pick)

googlesheets by Jennifer Bryan finally allows me to directly output to Google Sheets, instead of output it to xlsx format and then push it (mostly manually) to Google Drive. At our company we use Google Drive as a data communication and storage tool for the management, so outputing Data Science results to Google Sheets is important. We even have some small reports stored in Google Sheets. The package allows for easy creating, finding, filling, and reading of Google Sheets with an incredible simplicity of use.

AnomalyDetection

(Kirill’s second pick. He gets to pick two since he is so indecisive)

AnomalyDetection was developed by Twitter’s data scientists and introduced to the open source community in the first week of the year. A very handy, beautiful, well-developed tool to find anomalies in the data. This is very important for a data scientist to be able to find anomalies in the data fast and reliably, before real damage occurs. The package allows you to get a good first impression of the things going on in your KPIs (Key Performance Indicators) and react quickly. Building alerts with it is a no-brainer if you want to monitor your data and assure data quality.

emoGG

(Jess’s pick)

emoGG is definitely falling in the category “most whatever” R package of the year. What this package does is fairly simple: it allows you to display emojis in your ggplot2 plots, either as plotting symbols or as a background. Under the hood, it adds a geom_emoji layer to your ggplot2 plots, in which you have to specify one or more emoji codes corresponding to the emojis you wish to plot. emoGG can be used to make visualisations more compelling and make plots transport more meaning, no doubt. But before anything else, it’s fun and a must have for an avid emoji fan like me.

Our R package roundup was originally published by Kirill Pomogajko at Opiate for the masses on December 30, 2015.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses.

↧

Write in-line equations in your Shiny application with MathJax

December 30, 2015, 11:58 am

≫ Next: Explorable, multi-tabbed reports in R and Shiny

≪ Previous: Our R package roundup

(This article was first published on SAS and R, and kindly contributed to R-bloggers)

I’ve been working on a Shiny app and wanted to display some math equations. It’s possible to use LaTeX to show math using MathJax, as shown in this example from the makers of Shiny. However, by default, MathJax does not allow in-line equations, because the dollar sign is used so frequently. But I needed to use in-line math in my application. Fortunately, the folks who make MathJax show how to enable the in-line equation mode, and the Shiny documentation shows how to write raw HTML. Here’s how to do it.

Here I replicated the code from the official Shiny example linked above. The magic code is inserted into ui.R, just below withMathJax().
## ui.R


library(shiny)

shinyUI(fluidPage(
  title = 'MathJax Examples with in-line equations',
  withMathJax(),
  # section below allows in-line LaTeX via $ in mathjax.
  tags$div(HTML("
                ")),
  helpText('An irrational number $\sqrt{2}$
           and a fraction $1-\frac{1}{2}$'),
  helpText('and a fact about $\pi$:$\frac2\pi = \frac{\sqrt2}2 \cdot
           \frac{\sqrt{2+\sqrt2}}2 \cdot
           \frac{\sqrt{2+\sqrt{2+\sqrt2}}}2 \cdots$'),
  uiOutput('ex1'),
  uiOutput('ex2'),
  uiOutput('ex3'),
  uiOutput('ex4'),
  checkboxInput('ex5_visible', 'Show Example 5', FALSE),
  uiOutput('ex5')
))



## server.R
library(shiny)

shinyServer(function(input, output, session) {
  output$ex1 <- renderUI({
    withMathJax(helpText('Dynamic output 1:  $\alpha^2$'))
  })
  output$ex2 <- renderUI({
    withMathJax(
      helpText('and output 2 $3^2+4^2=5^2$'),
      helpText('and output 3 $\sin^2(\theta)+\cos^2(\theta)=1$')
    )
  })
  output$ex3 <- renderUI({
    withMathJax(
      helpText('The busy Cauchy distribution
               $\frac{1}{\pi\gamma\,\left[1 +
               \left(\frac{x-x_0}{\gamma}\right)^2\right]}\!$'))
  })
  output$ex4 <- renderUI({
    invalidateLater(5000, session)
    x <- round(rcauchy(1), 3)
    withMathJax(sprintf("If $X$ is a Cauchy random variable, then
                        $P(X \leq %.03f ) = %.03f$", x, pcauchy(x)))
  })
  output$ex5 <- renderUI({
    if (!input$ex5_visible) return()
    withMathJax(
      helpText('You do not see me initially: $e^{i \pi} + 1 = 0$')
    )
  })
  })

Give it a try (or check out the Shiny app at https://r.amherst.edu/apps/nhorton/mathjax/)! One caveat is that the other means of in-line display, as shown in the official example, doesn’t work when the MathJax HTML is inserted as above.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

To leave a comment for the author, please follow the link and comment on their blog: SAS and R.

↧

Explorable, multi-tabbed reports in R and Shiny

January 2, 2016, 10:22 am

≫ Next: The R-Podcast Episode 15: Introduction to Shiny

≪ Previous: Write in-line equations in your Shiny application with MathJax

(This article was first published on Tom Hopper » RStats, and kindly contributed to R-bloggers)

Matt Parker recently showed us how to create multi-tab reports with R and jQuery UI. His example was absurdly easy to reproduce; it was a great blog post.

I have been teaching myself Shiny in fits and starts, and I decided to attempt to reproduce Matt’s jQuery UI example in Shiny. You can play with the app on shinyapps.io, and the complete project is up on Github. The rest of this post walks through how I built the Shiny app.

plot_all_states

It’s a demo

The result demonstrates a few Shiny and ggplot2 techniques that will be useful in other projects, including:

Creating tabbed reports in Shiny, with different interactive controls or widgets associated with each tab;
Combining different ggplot2 scale changes in a single legend;
Sorting a data frame so that categorical labels in a legend are ordered to match the position of numerical data on a plot;
Borrowing from Matt’s work,
- Summarizing and plotting data using dplyr and ggplot2;
- Limiting display of categories in a graph legend to the top n (selectable by the user), with remaining values listed as “other;”
- Coloring only the top n categories on a graph, and making all other categories gray;
- Changing line weight for the top n categories on a graph, and making;

Obtaining the data

As with Matt’s original report, the data can be downloaded from the CDC WONDER database by selecting “Data Request” under “current cases.”

To get the same data that I’ve used, group results by “state” and by “year,” check “incidence rate per 100,000” and, near the bottom, “export results.” Uncheck “show totals,” then submit the request. This will download a .txt tab-delimited data file, which in this app I read in using read_tsv() from the readr package.

Looking at Matt’s example, his “top 5” states look suspiciously like the most populous states. He’s used total count of cases, which will be biased toward more populous states and doesn’t tell us anything interesting. When examining occurrences—whether disease, crime or defects—we have to look at the rates rather than total counts; we can only make meaningful comparisons and make useful decisions from an examination of the rates.

Setup

As always, we need to load our libraries into R. For this example, I use readr, dplyr, ggplot2 and RColorBrewer.

The UI

The app generates three graphs: a national total, that calculates national rates from the state values; a combined state graph that highlights the top states, where the user chooses ; and a graph that displays individual state data, where the user can select the state to view. Each goes on its own tab.

ui.R contains the code to create a tabset panel with three tab panels.

tabsetPanel(
  tabPanel("National", fluidRow(plotOutput("nationPlot"))),
  tabPanel("By State",
           fluidRow(plotOutput("statePlot"),
                    wellPanel(
                      sliderInput(inputId = "nlabels",
                                  label = "Top n States:",
                                  min = 1,
                                  max = 10,
                                  value = 6,
                                  step = 1)
                    )
           )
  ),
  tabPanel("State Lookup",
           fluidRow(plotOutput("iStatePlot"),
                    wellPanel(
                      htmlOutput("selectState"))
           )
  )
)

Each panel contains a fluidRow element to ensure consistent alignment of graphs across tabs, and on tabs where I want both a graph and controls, fluidRow() is used to add the controls below the graph. The controls are placed inside a wellPanel() so that they are visually distinct from the graph.

Because I wanted to populate a selection menu (selectInput()) from the data frame, I created the selection menu in server.R and then displayed it in the third tab panel set using the htmlOutput() function.

The graphs

The first two graphs are very similar to Matt’s example. For the national rates, the only change is the use of rates rather than counts.

df_tb <- read_tsv("../data/OTIS 2013 TB Data.txt", n_max = 1069, col_types = "-ciiii?di")

df_tb %>%
  group_by(Year) %>%
  summarise(n_cases = sum(Count), pop = sum(Population), us_rate = (n_cases / pop * 100000)) %>%
  ggplot(aes(x = Year, y = us_rate)) +
  geom_line() +
  labs(x = "Year Reported",
       y = "TB Cases per 100,000 residents",
       title = "Reported Active Tuberculosis Cases in the U.S.") +
  theme_minimal()

The main trick, here, is the use of dplyr to summarize the data across states. Since we can’t just sum or average rates to get the combined rate, we have to sum all of the state counts and populations for each year, and add another column for the calculated national rate.

To create a graph that highlights the top states, we generate a data frame with one variable, State, that contains the top states. This is, again, almost a direct copy of Matt’s code with changes to make the graph interactive within Shiny. This code goes inside of the shinyServer() block so that it will update when the user selects a different value for . Instead of hard-coding , there’s a Shiny input slider named nlabels. With a list of the top states ordered by rate of TB cases, df_tb is updated with a new field containing the top state names and “Other” for all other states.

top_states <- df_tb %>%
      filter(Year == 2013) %>%
      arrange(desc(Rate)) %>%
      slice(1:input$nlabels) %>%
      select(State)

df_tb$top_state <- factor(df_tb$State, levels = c(top_states$State, "Other"))
df_tb$top_state[is.na(df_tb$top_state)] <- "Other"

The plot is generated from the newly-organized data frame. Where Matt’s example has separate legends for line weight (size) and color, I’ve had ggplot2 combine these into a single legend by passing the same value to the “guide =” argument in the scale_XXX_manual() calls. The colors and line sizes also have to be updated dynamically for the selected .

    df_tb %>%
      ggplot() +
      labs(x = "Year reported",
           y = "TB Cases per 100,000 residents",
           title = "Reported Active Tuberculosis Cases in the U.S.") +
      theme_minimal() +
      geom_line(aes(x = Year, y = Rate, group = State, colour = top_state, size = top_state)) +
      scale_colour_manual(values = c(brewer.pal(n = input$nlabels, "Paired"), "grey"), guide = guide_legend(title = "State")) +
      scale_size_manual(values = c(rep(1,input$nlabels), 0.5), guide = guide_legend(title = "State"))
  })

})

The last graph is nearly a copy of the national totals graph, except that it is filtered for the state selected in the drop-down menu control. The menu is as selectInput() control.

renderUI({
    selectInput(inputId = "state", label = "Which state?", choices = unique(df_tb$State), selected = "Alabama", multiple = FALSE)
  })

With a state selected, the data is filtered by the selected state and TB rates are plotted.

df_tb %>%
  filter(State == input$state) %>%
  ggplot() +
  labs(x = "Year reported",
       y = "TB Cases per 100,000 residents",
       title = "Reported Active Tuberculosis Cases in the U.S.") +
  theme_minimal() +
  geom_line(aes(x = Year, y = Rate))

Wrap up

I want to thank Matt Parker for his original example. It was well-written, clear and easy to reproduce.

To leave a comment for the author, please follow the link and comment on their blog: Tom Hopper » RStats.

↧

The R-Podcast Episode 15: Introduction to Shiny

December 31, 2015, 1:11 pm

≫ Next: Analyzing movie connections with R

≪ Previous: Explorable, multi-tabbed reports in R and Shiny

(This article was first published on The R-Podcast (Podcast), and kindly contributed to R-bloggers)

Just in time for the new year is a new episode of the R-Podcast! I give a brief introduction to the Shiny package for creating web applications using R code, provide some of my tips and tricks I have learned (sometimes the hard way) when creating applications, and point to excellent resources and example apps in the community that show the immense potential at your fingertips. You will see that r-podcast.org has gotten a major overhaul, and as a consequence the RSS feeds have changed slightly. Be sure to check out the Subscribe page for the updated feeds, but all of the previous episodes have been migrated successfully. As always you can provide your feedback in multiple ways:

New Feature: Provide a comment on this episode post directly (powered by the Disqus commenting system)
Email the show at thercast[at]gmail.com
Use the new Contact Form directly on the site.
Leave a voicemail at at +1-269-849-9780

Happy New Year and I hope you enjoy the episode!

Direct Download: [mp3 format] [ogg format]

Episode 15 Show Notes

r-podcast.org gets a face lift!

Now powered by the awesome Nikola static site generator. Able to write all content using markdown!
Potential to use R-Markdown for future content! See Edward Borasky’s excellent tutorial: http://www.znmeb.mobi/stories/blogging-with-rstudio-and-nikola
Shout out to Roberto and the rest of the Nikola contributors for helping me fix some key migration issues! Still a few tweaks to go, pardon the dust as I continue to make improvements.
Now with SSL support via the lets encrypt initiative, and the certificate is absolutely free!

My shiny development tips

Start with the excellent Shiny development portal by RStudio as well as recent webinars
Also check Dean Attali’s great tutorial on his blog
Shiny UI: Make sure to not have any missing commas or too many commas!
On top of the official shiny app gallery, also check out the shiny user showcase as well as showmeshiny.com for great examples.
Many shiny functions (such as reactive) allow you to supply R code enclosed in {} as the first parameter. Like writing a regular R function, make sure that you explicitely call the desired result object at the end or use a return call.
Using the sidebar layout is good for apps with a few UI controls and output containers, but my complex apps benefit from the flexibility offered by the grid layout system. See the layout article for more details.

Apps that helped me learn the power of Shiny

Radiant: Business analytics using R and Shiny by Vincent Nijs

Keeping up with the Shiny community

shiny-discuss mailing list
shiny-related blog posts on R-Bloggers

New features to watch

Interactive Plots within Shiny itself. See the interactive plots, selecting rows of data, and interactive plots advanced articles.
Plotly interaction with Shiny. See the Plotly graphs in shiny tutorial for more details.

R Community Roundup

Building Widgets blog by Kent Russell: Great showcase of converting many different javascript libraries for use in R, many of which are a great fit for Shiny.

Package Pick

stubthat: Provide stubs for use while unit testing in R.

News

ggplot2 version 2.0.0 released!

“Perhaps the bigggest news in this release is that ggplot2 now has an official extension mechanism. This means that others can now easily create their on stats, geoms and positions, and provide them in other packages. This should allow the ggplot2 community to flourish, even as less development work happens in ggplot2 itself. See vignette(“extending-ggplot2″) for details.

Additional details can be seen in the release notes

Feedback

Leave a comment on this episode’s post
Email the show: thercast[at]gmail.com
Use the R-Podcast contact page
Leave a voicemail at +1-269-849-9780

Music Credits

Opening and closing themes: Training Montage by WillRock from the Return All Robots Remix Album at ocremix.org

To leave a comment for the author, please follow the link and comment on their blog: The R-Podcast (Podcast).

↧