Paging Widget for Shiny Apps

October 20, 2015, 11:17 pm

≫ Next: 20 (!) new R jobs (for 2015-10-21)

≪ Previous: Revisiting crimes against women in India

(This article was first published on Odd Hypothesis, and kindly contributed to R-bloggers)

In my last post I described how I built a shiny application called “DFaceR”
that used Chernoff Faces to plot multidimensional data. To improve application
response time during plotting, I needed to split large datasets into more
manageable “pages” to be plotted.

Rather than take the path of least resistance and use either numericInput or
sliderInput widgets that come with shiny to interact with paginated data, I
wanted nice page number and prev/next buttons like on a dataTables.js table.

In this post, I describe how I built a custom shiny widget called pager-ui
to achieve this.

Prior to starting, I did some research (via google) for any preexisting shiny
paging solutions. However, nothing matched what I wanted. So I set forth and
built my own solution using the following framework:

a couple of hidden numeric inputs to store the current page and total number
of pages
jquery event handlers bound to change events on the above numeric inputs
a javascript function to render page number buttons based on the currently
selected page and the total number of pages

The solution I deployed with DFaceR used a template javascript file that was updated
with the containing pager-ui element id when the widget was added to ui.R.
Reading, reacting to, and updating the widget required directly accessing the
hidden numeric inputs in server.R.

While this worked, it was not the formal way to build a custom input widget as
described by this shiny developer article.
Thankfully, only a little extra work was needed to build it the correct way.

According to the developer article, a custom input requires the following:

javascript code to
- define the widget’s behavior (i.e. jQuery event handlers)
- register an input binding object with shiny
an R function to add the HTML for the widget to ui.R

For reference, the (simplified) pager widget HTML is:

<div class="pager-ui">
  <!-- Input fields that Shiny will react to -->
  <div class="hidden">
    <input type="number" class="page-current" value="1" />
    <input type="number" class="pages-total" value="10" />
  </div>

  <div class="page-buttons">
    <!-- Everything in here is dynamically rendered via javascript -->

    <span class="btn-group">
      <button class="btn btn-default page-prev-button">Prev</button>
      <button class="btn btn-default page-next-button">Next</button>
    </span>

    <span class="btn-group">
      <!-- button for the current page has class btn-info -->
      <button class="btn btn-info page-num-button" data-page-num="1">1</button>

      <!-- other buttons have class btn-default -->
      <button class="btn btn-default page-num-button" data-page-num="2">2</button>

      <!-- ... rest of page num buttons ... -->
    </span>

  </div><!-- /.page-buttons -->
</div><!-- /.pager-ui -->

As a personal preference, I’ve put the prev and next buttons together in
their own btn-group because it keeps their positions in the page consistent
rather than have them jump around depending on how many page buttons are rendered.

Also note that the page-num buttons encode their respective page numbers in
data-page-num attributes.

Everything in the div.page-buttons element is rendered dynamically via javascript.

The javascript …

As is probably the case with most widgets for shiny, most of the code for
pager-ui is written in javascript. This is understandable since much of the
user facing interactivity happens in a web browser.

The pager widget has the following behaviors:

in/de-crease the current page number with next/previous button clicks
set the current page number with page number button clicks
rerender the buttons as needed when the current page number changes
rerender the buttons as needed when the total number of pages changes

To keep things a little DRY, I use a simple object for accessing the specific
pager-ui element being used:

PagerUI = function(target, locate, direction) {
  var me = this;

  me.root = null;
  me.page_current = null;
  me.pages_total = null;

  if (typeof locate !== 'undefined' && locate) {
    if (direction === 'child') {
      me.root = $(target).find(".pager-ui").first();
    } else {
      // default direction is to search parents of target
      me.root = $(target)
        .parentsUntil(".pager-ui")
        .parent();  // travers to the root pager-ui node
    }
  } else {
    // pager-ui node is explicitly specified
    me.root = $(target);
  }

  if (me.root) {
    me.page_current = me.root.find(".page-current");
    me.pages_total = me.root.find(".pages-total");
  }

  return(me);
};

This takes a selector or jQuery object in target for either specifying the
specific pager-ui container used, or a starting point (e.g. a child button)
from which to search for the container.

This keeps the event handler callbacks relatively short and easier to maintain.
In total, there are four event handlers which map directly to the behaviors
described above, all of them delegated to the document node of the DOM.

First, one to handle clicks from page-number buttons:

// delegate a click event handler for pager page-number buttons
$(document).on("click", "button.page-num-button", function(event) {
  var $btn = $(event.target);
  var page_num = $btn.data("page-num");
  var $pager = new PagerUI($btn, true);

  $pager.page_current
    .val(page_num)
    .trigger("change");

});

Next, a couple to handle clicks from previous and next buttons:

$(document).on("click", "button.page-prev-button", function(event) {
  var $pager = new PagerUI(event.target, true);

  var page_current = parseInt($pager.page_current.val());

  if (page_current > 1) {
    $pager.page_current
      .val(page_current-1)
      .trigger('change');
  }

});

$(document).on("click", "button.page-next-button", function(event) {
  var $pager = new PagerUI(event.target, true);

  var page_current = parseInt($pager.page_current.val());
  var pages_total = parseInt($pager.pages_total.val());

  if (page_current < pages_total) {
    $pager.page_current
      .val(page_current+1)
      .trigger('change');
  }

});

Finally, a couple handlers to catch change events on the hidden numeric fields
and rerender the widget:

// delegate a change event handler for pages-total to draw the page buttons
$(document).on("change", "input.pages-total", function(event) {
  var $pager = new PagerUI(event.target, true);
  pagerui_render($pager.root);
});

// delegate a change event handler for page-current to draw the page buttons
$(document).on("change", "input.page-current", function(event) {
  var $pager = new PagerUI(event.target, true);
  pagerui_render($pager.root);
});

Rendering is done via the pagerui_render() function. It is pretty long so
check out the source (linked below) for the full details. In a nutshell it:

renders all of the page-number buttons needed for the following cases, using
... spacer buttons when necessary:
- current page is within the first 3 pages
- current page is within the last 3 pages
- current page is somewhere in the middle
sets the enabled state of the prev and next buttons depending on the currently
selected page (e.g. the prev button is disabled if the current page is 1).

Shiny registration

To fully tie the widget to shiny it needs to be “registered”. This basically
provides a standard interface between the widget and shiny’s core javascript
framework via an input binding object.

The shiny input binding for pager-ui is:

var pageruiInputBinding = new Shiny.InputBinding();
$.extend(pageruiInputBinding, {
  find: function(scope) {
    return( $(scope).find(".pager-ui") );
  },
  // getId: function(el) {},
  getValue: function(el) {
    return({
      page_current: parseInt($(el).find(".page-current").val()),
      pages_total: parseInt($(el).find(".pages-total").val())
    });
  },
  setValue: function(el, value) {
    $(el).find(".page-current").val(value.page_current);
    $(el).find(".pages-total").val(value.pages_total);
  },
  subscribe: function(el, callback) {
    $(el).on("change.pageruiInputBinding", function(e) {
      callback(true);
    });
    $(el).on("click.pageruiInputBinding", function(e) {
      callback(true);
    });
  },
  unsubscribe: function(el) {
    $(el).off(".pageruiInputBinding");
  },
  getRatePolicy: function() {
    return("debounce");
  },

  /**
   * The following two methods are not covered in the developer article, but
   * are documented in the comments in input_binding.js
   */
  initialize: function(el) {
    // called when document is ready using initial values defined in ui.R
    pagerui_render(el);
  },
  receiveMessage: function(el, data) {
    // This is used for receiving messages that tell the input object to do
    // things, such as setting values (including min, max, and others).
    if (data.page_current) {
      $(el).find(".page-current")
        .val(data.page_current)
        .trigger('change');
    }

    if (data.pages_total) {
      $(el).find(".pages-total")
        .val(data.pages_total)
        .trigger('change');
    }
  }
});

Shiny.inputBindings
  .register(pageruiInputBinding, "oddhypothesis.pageruiInputBinding");

Let’s break this apart …

There are nine methonds defined in the interface. The first seven are documented
by the developer article,

find: locate the widget and return a jQuery object reference to it
getValue: return the widget’s value (can be JSON if complex)
setValue: not used
subscribe: binds event callbacks to the widget, optionally specifying use of
a rate policy. Note the use of jQuery event namespacing
unsubscribe: unbinds event callbacks on the widget – again using jQuery event
namespacing
getRatePolicy: specifies the rate policy to used – either “throttle” or
“debounce”

and are pretty close to the boilerplate examples with only a few custom changes.

First, the getValue method returns a JSON object with two properties
(page_current and pages_total):

getValue: function(el) {
  return({
    page_current: parseInt($(el).find(".page-current").val()),
    pages_total: parseInt($(el).find(".pages-total").val())
  });
}

This means that when the widget is accessed in server.R via the input object,
it will return a list() with the following structure:

List of 2
 $ page_current: int 1
 $ pages_total : int 4

Second, event callbacks are subscribed with a “debounce” rate policy:

subscribe: function(el, callback) {
  $(el).on("change.pageruiInputBinding", function(e) {
    callback(true);
  });
  $(el).on("click.pageruiInputBinding", function(e) {
    callback(true);
  });
},
getRatePolicy: function() {
  return("debounce");
}

This prevents excessive callback executions, and subsequent weird behavior, if
the prev and next buttons are clicked too rapidly.

The last two methods of the input binding,

initialize
receiveMessage

are ones that I added based on documentation I found in shiny’s
source code for input bindings.

The initialize method is called when the document is ready, which I found
necessary to, well, initialize the widget with default values. For this widget,
all that needs to happen is for it to be rendered for the first time.

initialize: function(el) {
  // called when document is ready using initial values defined in ui.R
  pagerui_render(el);
}

The receiveMessage method is used to communicate with the widget from server.R.
In most cases, this will send a data update, but one could imagine other useful
messages that could be sent.

receiveMessage: function(el, data) {
  // This is used for receiving messages that tell the input object to do
  // things, such as setting values (including min, max, and others).
  if (data.page_current) {
    $(el).find(".page-current")
      .val(data.page_current)
      .trigger('change');
  }

  if (data.pages_total) {
    $(el).find(".pages-total")
      .val(data.pages_total)
      .trigger('change');
  }
}

To finish up the input binding, it is registered with:

Shiny.inputBindings
  .register(pageruiInputBinding, "oddhypothesis.pageruiInputBinding");

As a good measure, I placed all of the above javascript in an
immediately invoked function expression:

(function(){
  // ... code ...
}());

to ensure that I didn’t inadvertently overwrite any variables in the global scope.

All of the above javascript lives in one file that is placed in:

<app>
|- ui.R
|- server.R
|- global.R
+- www/
   +- js/
      +- input_binding_pager-ui.js  <-- here

The R code …

Compared to the javascript code, the R code is fairly simple. There are two
functions:

pageruiInput(): to put the widget in the layout, used in ui.R
updatePageruiInput(): to update the widget with new data, used in server.R

The widget requires two javascript files:

input_binding_pager-ui.js containing all the behavior and shiny input binding code
underscore-min.js a dependency of pagerui_render()

These files only need to be referenced in the app once, regardless of how many
pager-ui widgets are used. Therefore, they are added using singleton() in
the R code:

tagList(
  singleton(
    tags$head(
      tags$script(src = 'js/underscore-min.js'),
      tags$script(src = 'js/input_binding_pager-ui.js')
    )
  ),

  # ... rest of html generation code ...
)

The rest of the HTML generation code follows the layout specified earlier with
special considerations for making the numeric input field ids unique by propagating
the pager-ui id, and setting default numeric values.

pageruiInput = function(inputId, page_current = NULL, pages_total = NULL) {
  # construct the pager-ui framework
  tagList(
    singleton(
      tags$head(
        tags$script(src = 'js/underscore-min.js'),
        tags$script(src = 'js/input_binding_pager-ui.js')
      )
    ),

    # root pager-ui node
    div(
      id = inputId,
      class = 'pager-ui',

      # container for hidden numeric fields
      div(
        class = 'hidden',

        # numeric input to store current page
        tags$input(
          id = paste(inputId, 'page_current', sep='__'),
          class = 'page-current',
          type = 'number',
          value = ifelse(!is.null(page_current), page_current, 1),
          min = 1,
          max = ifelse(!is.null(pages_total), pages_total, 1)
        ),

        # numeric input to store total pages
        tags$input(
          id = paste(inputId, 'pages_total', sep='__'),
          class = 'pages-total',
          type = 'number',
          value = ifelse(!is.null(pages_total), pages_total, 0),
          min = 0,
          max = ifelse(!is.null(pages_total), pages_total, 0)
        )
      ),

      # container for pager button groups
      div(
        class = 'page-buttons',

        # prev/next buttons
        span(
          class = 'page-button-group-prev-next btn-group',
          tags$button(
            id = paste(inputId, 'page-prev-button', sep='__'),
            class = 'page-prev-button btn btn-default',
            'Prev'
          ),
          tags$button(
            id = paste(inputId, 'page-next-button', sep='__'),
            class = 'page-next-button btn btn-default',
            'Next'
          )
        ),

        # page number buttons
        # dynamically generated via javascript
        span(
          class = 'page-button-group-numbers btn-group'
        )
      )
    )
  )
}

To update the widget from server.R there is an updatePageruiInput() function
whose body was effectively copied from other update*() functions that are used
for other inputs (notably text and numeric inputs).

updatePageruiInput = function(
  session, inputId, page_current = NULL, pages_total = NULL) {

  message = shiny:::dropNulls(list(
    page_current = shiny:::formatNoSci(page_current),
    pages_total = shiny:::formatNoSci(pages_total)
  ))

  session$sendInputMessage(inputId, message)
}

Thus, to add a pager-ui widget to a shiny ui:

# ui.R
shinyUI(pageWithSideBar(
  headerPanel(...),
  sidebarPanel(...),
  mainPanel(
    pageruiInput(inputId='pager', page_current = 1, pages_total = 1),
    ...
  )
))

On the server side, the value from the widget is accessed by:

# server.R
shinyServer(function(input, output, session) {
  # ...

  pager_state = reactive({
    input$pager
  })

  # ...
}

which will return a list with two elements page_current and pages_total.

To update the widget from server.R simply call updatePageruiInput() as needed:

# server.R
shinyServer(function(input, output, session) {
  # ...

  observeEvent(
    eventExpr = {
      input$btn_update_page
    },
    handlerExpr = {
      new_page = # ... code to determine new page ...
      updatePageruiInput(session, 'pager', page_current = new_page)
    }
  )

  # ...
})

See for yourself

The source code for a demo shiny app that uses this widget, and contains all the
code needed to add this widget to other apps is
available on Github.

Happy paging.

Written with StackEdit.

To leave a comment for the author, please follow the link and comment on their blog: Odd Hypothesis.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

20 (!) new R jobs (for 2015-10-21)

October 21, 2015, 7:26 am

≫ Next: Shiny

≪ Previous: Paging Widget for Shiny Apps

This is the bi-monthly R-bloggers post (for 2015-10-21) for new R Jobs.

To get your R job on the next post

Just visit this link and post a new R job to the R community (it’s free and quick).

New R jobs

Job seekers: please follow the links below to learn more and apply for your job of interest (if you go to R-users.com you may see all the R jobs that are currently available)

Temporary

Statistical Analyst (a 4 week period starting 11/6) @ New Jersey
Kalibrate – Posted by Kalibrate

Florham Park
New Jersey, United States

19 Oct2015
Full-Time

R Developer @ Seattle or Santa Clarita (over 100K/year)
Holland America Line – Posted by kparsons

Seattle
Washington, United States

19 Oct2015
Full-Time

R Ninja @ London
London Morgannwg – Posted by robert@lonmor.com

London
England, United Kingdom

19 Oct2015
Part-Time

Data Science Course Mentor – Part Time/Flexible
SlideRule – Posted by parul

Anywhere

16 Oct2015
Full-Time

Customer Success Representative for RStudio @ Boston
RStudio – Posted by jclemens1

Boston
Massachusetts, United States

15 Oct2015
Full-Time

Data scientist/developer @ London
Prism Financial Products – Posted by prismfp

London
England, United Kingdom

14 Oct2015
Full-Time

Principal Data Analyst @ Minnesota (over 100K/year)
Great River Energy – Posted by GRE001

Maple Grove
Minnesota, United States

13 Oct2015
Temporary

R Developer Fluent in Shiny and ggvis ($100 for ~2 hours gig)
Simit Patel – Posted by simitpatel

Anywhere

12 Oct2015
Full-Time

Data Scientist @Zürich
Eos Commerce Ag – Posted by nkolster

Zürich
Zürich, Switzerland

12 Oct2015
Full-Time

Big Data Analyst @ England
Hartree Centre – STFC – Posted by gendrin

England
United Kingdom

12 Oct2015
Freelance

Co-founder to develop a user-friendly Currency Future Options
Peejay Lal – Posted by optionhk

Anywhere

11 Oct2015
Part-Time

Data Scientist ( @ Israel)
IoTM Solutions Ltd. – Posted by Globalconect

Ramat Hasharon
Tel Aviv District, Israel

11 Oct2015
Internship

Internship in the OECD – Statistics for Development (@ Paris)
OECD/PARIS21 – Posted by tk375

Paris
Île-de-France, France

10 Oct2015
Full-Time

Stanford Law School Research Fellow
Stanford Law School – Posted by slsfellow15

Stanford
California, United States

10 Oct2015
Full-Time

Data Scientist @ London
Data Reply UK – Posted by crn

London
England, United Kingdom

9 Oct2015
Full-Time

Data Analyst @ Pittsburgh, Pennsylvania
Niche.com – Posted by NicheLaura

Pittsburgh
Pennsylvania, United States

9 Oct2015
Full-Time

Data Engineer @ London
Data Reply UK – Posted by crn

London
England, United Kingdom

8 Oct2015
Full-Time

Web Programmer in Milan, Italy
Quantide – Posted by nicola.sturaro

Legnano
Lombardia, Italy

7 Oct2015
Full-Time

Statistical Consulting, Teaching and Coordination (@ Leipzig Germany)
Helmholtz Centre for Environmental Research (UFZ), Leipzig, Germany – Posted by cdislich

Leipzig
Sachsen, Germany

7 Oct2015
Full-Time

Bioinformatician (@ Mechelen)
Bluebee – Posted by Karabaliev

Mechelen
Vlaanderen, Belgium

6 Oct2015

r_jobs

(you may also look at previous R jobs posts).

↧

Shiny

October 22, 2015, 2:07 pm

≫ Next: Free Webinar: Make a Census Explorer with Shiny

≪ Previous: 20 (!) new R jobs (for 2015-10-21)

(This article was first published on ipub » R, and kindly contributed to R-bloggers)

No code for once. This is an independent, introductory article about RStudio’s Shiny web application framework for R. This could be useful for people who don’t know what Shiny is. Or for those of you who want to find out when to use Shiny.

What does Shiny do?

Shiny is a fine product. It allows you to make your data and research accessible to a non-R audience. This is useful in many situations. For example, if you are a pharma researcher, you can provide your decision makers with the summary data, research and reports they need. Or, if you are an academic researcher, you can publish your research online, in a reproducible manner.

Also, Shiny brings interactivity to R, letting you add selectors, sliders, and other input widgets to your screens.

If you have never seen a Shiny app, make sure to check out the Shiny Gallery.

What is Shiny?

Shiny is an ecosystem, which comprises:

shiny, an R package
Shiny Server, an application Server (in different flavors)

The first one, the R package, contains itself a minimalistic application server as well. So, to get started, all you need is the R package.

How do I build a Shiny app?

No worries, this post contains no code. But there are plenty of introductory examples in the web. For example, I can highly recommend the RStudio Shiny tutorial and this page: http://shiny.rstudio.com/articles/.

How difficult is Shiny?

Even if you are a seasoned R developer, some things will look unfamiliar. Shiny implements a form of event handlers, called reactive elements. These work great and contribute a lot to the Shiny magic. However, reactivity is not a concept common in R, where program flow is typically sequential. If you want to know more about reactivity, see here.

But even if you are a seasoned web developer, Shiny might not be a piece of cake. R syntax was not built for web development, and making the curly braces match is sometimes bloodcurdling.

Also, Shiny relies heavily on java script, internally. And even if they tried to hide it as much as possible, this often shines through (I like this pun ;-). So, for anything that goes a bit beyond the very basic examples, be prepared to learn a bit of java script.

Having said this, I promise that anyone who has mastered the 16 parameters of the ols function is intellectually fit to learn Shiny. The API is clean and consistent, the documentation is pretty good, and lots of examples will help you getting started quickly.

Alternatives to Shiny

Shiny is the right tool if:

you want to make a single app accessible to a wider audience
if you want to add interactivity to your R code (though there are other ways to do this, e.g. directly in the viewer in RStudio)

Shiny might not be the best choice if:

you are a seasoned web developer
if you want to build an entire web application with multiple connected screens, etc.

If either one is true, a best-of-breed approach might be better:

build the client, server, and database access in RoR (or php, or node.js / AngularJS, or meteor or whatever other web framework you fancy)
build the quantitative code in R, and access it in a stateless fashion through an OpenCPU or rApache API or Azure ML.

Finally, in certain situations you might fancy a hybrid approach:

Build the basic structure of your app in a dedicated web framework (e.g. RoR)
integrate specific Shiny apps in your web application, e.g. using iFrames or by directly reverse proxying through Apache

For more details on these options, refer to this jenunderwood blogpost.

Shiny Server Licenses

Academics and corporations have different budgets, and sometimes different needs. For this reason, RStudio, the company behind Shiny, decided to offer Shiny with different licenses:

an open source version
a commercial version, called Shiny Pro: This version is also targeted at self-hosters. It adds a few enterprise features (such as https, authentication, scalability) and support, starting at roughly 10’000 USD per year.
a hosted cloud solution, called shinyapps.io: With this offering, you do not need to install and manage your own server, worry about backups, scaling out, etc. There are different subscriptions available, starting with the free subscription.

Each has its use cases. Here is my recommendation:

Beginner: if you are new to Shiny, you won’t need any of the above. The bare Shiny R package, available from CRAN will get you started immediately.
Casual User: if you have very few public apps (less than 5), and want to share them with very few people (one at the time), who use them very rarely (less than 25 hours per month), then shinyapps.io FREE is a very convenient solution. Don’t use this in a blog post (as I have done), or your application stops working after a day because you’ll be over the 25 hours limit quickly.
Company: if you work in a company, and if you rely on Shiny for your business, you’ll most certainly want to Shiny Server Professional. It lets you keep your apps and data private, in-house, while guaranteeing continuity through the RStudio support
Researcher: if you have many public apps and/or many users, then the Shiny Server Open Source is probably the right choice. It’s not that hard to install it, and hosting on e.g. AWS or at your institution can be very cheap.

The paying shinyapps.io offerings probably also have their use case. But even if you don’t have a lot of hosting experience, deploying Shiny Server Open Source is not very difficult. And even if it does not offer https and password protection out of the box, these can be achieved e.g. with an Apache proxy, as I’ve done before in a few projects.

The post Shiny appeared first on ipub.

To leave a comment for the author, please follow the link and comment on their blog: ipub » R.

↧

Free Webinar: Make a Census Explorer with Shiny

October 22, 2015, 2:43 pm

≫ Next: Controlling a Remote R Session from a Local One

≪ Previous: Shiny

(This article was first published on AriLamstein.com » R, and kindly contributed to R-bloggers)

Next Friday at noon PDT I will be hosting a free webinar titled Make a Census Explorer with Shiny. You can register for it here. Space is limited, so if you are interested then I recommend that you reserve your space today.

The content will be very similar to the workshop that I ran at the San Francisco R-Ladies group in July. You can learn more about that workshop here. A lot of people have requested more information about that presentation, so I thought that a free webinar might be a good way to help people. I hope to see you there!

The post Free Webinar: Make a Census Explorer with Shiny appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: AriLamstein.com » R.

↧

Controlling a Remote R Session from a Local One

October 28, 2015, 4:58 am

≫ Next: Statistical Graphics and Visualization course materials

≪ Previous: Free Webinar: Make a Census Explorer with Shiny

(This article was first published on librestats » R, and kindly contributed to R-bloggers)

Say you have an Amazon EC2 instance running and you want to be able to control your R session running there from your local R session. At heart, this is not a new idea for the R community. You can already control remote R sessions easily with Shiny or RStudio server, for instance. Well now you can also try the experimental remoter package, available on github. So while this isn’t really tackling an unsolved problem, I think this approach, for better or worse, is unique.

You use it basically just like it says in the readme. The basic idea is to ssh to your remote and start up a server (I recommend using tmux for this, but you could just use fork with something like Rscript -e "remoter::server()" &). Once that’s done, you connect from your local R session with remoter::client("my.remote.address"). This ignores the usual issues, like port forwarding, which of course are relevant.

The package uses ZeroMQ by way of pbdZMQ to handle communication in a Request/Reply client/server pattern. There’s also a custom R REPL to automatically handle intercepting the local command, broadcasting it, executing it, and communicating back the return. The REPL design is kind of complicated, because it was actually designed for another purpose entirely, namely managing multiple batch MPI servers from a single client in the pbdCS package. remoter is just pbdCS with all the MPI/multiple server stuff stripped out.

I don’t know if I’d be willing to use remoter in any kind of production environment; it’s more a cool proof of concept than anything really. On the other hand, if you’re into using R on supercomputers, you might want to keep an eye on pbdCS and friends.

To leave a comment for the author, please follow the link and comment on their blog: librestats » R.

↧

Statistical Graphics and Visualization course materials

October 28, 2015, 1:42 pm

≫ Next: Modelled Territorial Authority GDP for New Zealand

≪ Previous: Controlling a Remote R Session from a Local One

(This article was first published on Civil Statistician » R, and kindly contributed to R-bloggers)

I’ve just finished teaching the Fall 2015 session of 36-721, Statistical Graphics and Visualization. Again, it is a half-semester course designed primarily for students in the MSP program (Masters of Statistical Practice) in the CMU statistics department. I’m pleased that we also had a large number of students from other departments taking this as an elective.

For software we used mostly R (base graphics, ggplot2, and Shiny). But we also spent some time on Tableau, Inkscape, D3, and GGobi.

We covered a LOT of ground. At each point I tried to hammer home the importance of legible, comprehensible graphics that respect human visual perception.

Remaking pie charts is a rite of passage for statistical graphics students

My course materials are below. Not all the slides are designed to stand alone, but I have no time to remake them right now. I’ll post some reflections separately.

Download all materials as a ZIP file (38 MB), or browse individual files:

Syllabus and Suggested Readings list; our required texts were Cairo’s The Functional Art and Donahue’s Fundamental Statistical Concepts in Presenting Data
Homeworks:
Projects:
Lectures:
- 01 Introduction slides
- 02 Legible Graphics slides, R code, R output, addendum
- 03 Visual Perception slides, R code, R output
- 04 Grammar of Graphics slides, R code, R output
- 05 Graphic Design slides, example component graphs (STEM, NonSTEM, Business) and layout
- 06 Interaction Design slides, Shiny code and data, D3 code and data
- 07 Visualization Research slides
- 08 Shiny Lab Session slides
- 09 Graphics for Statistical Analysis slides, R code, R output, data (tips, ganglion)
- 10 Mapping slides, R code, R output, data (MT, PA, USA)
- 11 High-Dimensional Data slides
- 12 Networks and Trees slides
- 13 Wrap-up slides, R code, R output
- NHANES extract dataset used in several lectures

Please note:

The examples, papers, blogs and researchers linked here are just scratching the surface. I meant no offense to anyone left out. I’ve simply tried to link to blogs, Twitter, and researchers’ websites that are actively updated.
I have tried my best to include attribution, citations, and links for all images (besides my own) in the lecture slides. Same for datasets in the R code. Wherever I use scans from a book, I have contacted the authors and do so with their approval (Alberto Cairo, Di Cook, Mark Monmonier, Colin Ware, & Robin Williams). However, if you are the creator or copyright holder of any images here and want them removed or the attribution revised, please let me know and I will comply.
Most of the cited books have an Amazon Associates link. If you follow these links and buy something during that visit, I get a small advertising fee (in the form of an Amazon gift card). Each year so far, these fees have totaled under $100 a year. I just spend it on more dataviz books

To leave a comment for the author, please follow the link and comment on their blog: Civil Statistician » R.

↧

Modelled Territorial Authority GDP for New Zealand

October 29, 2015, 4:00 am

≫ Next: Shiny Developer Conference | Stanford University | January 2016

≪ Previous: Statistical Graphics and Visualization course materials

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

A big project at my work that I was involved with over the past year was the production of modelled estimates of gross domestic product at the District and City level. There are official statistics at the Regional Council level (16 of them in New Zealand), so the aim of this project was to get that extra bit of granularity that makes analysis more meaningful.

Publication was on Wednesday 14 October 2015. Writing about the project here would mix up my work and personal personas, but here’s some links:

Main page including links to everything else such as the methodology, summary document, and web app
Interactive web tool, built with R, Shiny and ggvis
Source code on GitHub. Unfortunately you can’t run this, as it depends on databases only available in the MBIE environment. It’s published to make the methodology more transparent.

The day after MBIE published these, New Zealand Minister for Economic Development the Hon Steven Joyce launched the Regional Economic Activity Report, which includes some slices of the Modelled Territorial Authority GDP plus many, many other datasets at the Regional and Territorial Authority level. It has an interactive web tool that is very fancy indeed, and a mobile app that gives simpler but still powerful access to the same data. This was another big project at my workplace, with the team I used to manage cleaning and tidying data for a database of over 100 data series that was used under the hood for all the products.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

↧

Shiny Developer Conference | Stanford University | January 2016

October 29, 2015, 4:05 pm

≫ Next: 11 (new) R jobs from around the world (for 2015-11-02)

≪ Previous: Modelled Territorial Authority GDP for New Zealand

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

In the three years since we launched Shiny, our focus has been on helping people get started with Shiny. But there’s a huge difference between using Shiny and using it well, and we want to start getting serious about helping people use Shiny most effectively. It’s the difference between having apps that merely work, and apps that are performant, robust, and maintainable.

That’s why RStudio is thrilled to announce the first ever Shiny Developer Conference, to be held at Stanford University on January 30-31, 2016, three months from today. We’ll skip past the basics, and dig into principles and practices that will simultaneously simplify and improve the robustness of your code. We’ll introduce you to some brand new tools we’ve created to help you build ever larger and more complex apps. And we’ll show you what to do if things go wrong.

Check out the agenda to see the complete lineup of speakers and talks.

We’re capping the conference at just 90 people, so if you’d like to level up your Shiny skills, register now at http://shiny2016.eventbrite.com.

Hope to see you there!

Note that this conference is intended for R users who are already comfortable writing Shiny apps. We won’t cover the basics of Shiny app creation at all. If you’re looking to get started with Shiny, please see our tutorial.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

↧

11 (new) R jobs from around the world (for 2015-11-02)

November 2, 2015, 11:27 am

≫ Next: What’s the probability that a significant p-value indicates a true effect?

≪ Previous: Shiny Developer Conference | Stanford University | January 2016

This is the bi-monthly R-bloggers post (for 2015-11-02) for new R Jobs.

To post your R job on the next post

Just visit this link and post a new R job to the R community (it’s free and quick).

New R jobs

Job seekers: please follow the links below to learn more and apply for your job of interest (if you go to R-users.com you may see all the R jobs that are currently available)

Full-Time

Operations Analyst @ Durham/England
Northumbrian Water Group – Posted by Herve Vicente

Durham
England, United Kingdom

2 Nov2015
Full-Time

Senior Data Scientist for Booking.com @ Amsterdam
Booking.com – Posted by Booking.com

Amsterdam
Noord-Holland, Netherlands

2 Nov2015
Full-Time

Data Scientist for Booking.com @ Amsterdam
Booking.com – Posted by Booking.com

Amsterdam
Noord-Holland, Netherlands

2 Nov2015
Full-Time

Quantitative Analyst/Full Stack Engineer @ London (over 100K/year)
lekaninfinite

London
England, United Kingdom

2 Nov2015
Full-Time

Data Scientist – Biotechnology/Biostatistics ( @ Washington)
Origent Data Sciences – Posted by mkeymer@origent.com

Washington
District of Columbia, United States

29 Oct2015
Full-Time

Data Science Lead @ Missouri
Monsanto – Posted by Casey

Saint Louis
Missouri, United States

29 Oct2015
Freelance

R Shiny/ R Markdown Developer
Iowa Soybean Association – Posted by ISA Analytics

Anywhere

29 Oct2015
Full-Time

Advanced Clinical Analytics Consultant @ Columbia
Anthem, Inc. for Resolution Health, Inc. – Posted bySTATETR

Columbia
Maryland, United States

28 Oct2015
Full-Time

Data Scientist for VisionMobile
VisionMobile Ltd – Posted by visionmobile

Anywhere

27 Oct2015
Full-Time

Research Associate / Data Scientist @ Germany
Hochschule Osnabrück – Posted by gerhi

Osnabrück
Niedersachsen, Germany

23 Oct2015
Full-Time

Statistical Learning Scientist @ Seattle
Axio Research, LLC – Posted by dnadave

Seattle
Washington, United States

22 Oct2015

r_jobs

(you may also look at previous R jobs posts).

↧

What’s the probability that a significant p-value indicates a true effect?

November 3, 2015, 6:28 am

≫ Next: Literacy in India – A deepR dive

≪ Previous: 11 (new) R jobs from around the world (for 2015-11-02)

(This article was first published on Nicebread » R, and kindly contributed to R-bloggers)

If the p-value is < .05, then the probability of falsely rejecting the null hypothesis is <5%, right? That means, a maximum of 5% of all significant results is a false-positive (that’s what we control with the α rate).

Well, no. As you will see in a minute, the “false discovery rate” (aka. false-positive rate), which indicates the probability that a significant p-value actually is a false-positive, usually is much higher than 5%.

A common misconception about p-values

Oates (1986) asked the following question to students and senior scientists:

You have a p-value of .01. Is the following statement true, or false?

You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

The answer is “false” (you will learn why it’s false below). But 86% of all professors and lecturers in the sample who were teaching statistics (!) answered this question erroneously with “true”. Gigerenzer, Kraus, and Vitouch replicated this result in 2000 in a German sample (here, the “statistics lecturer” category had 73% wrong). Hence, it is a wide-spread error to confuse the p-value with the false discovery rate.

The False Discovery Rate (FDR) and the Positive Predictive Value (PPV)

To answer the question “What’s the probability that a significant p-value indicates a true effect?”, we have to look at the positive predictive value (PPV) of a significant p-value. The PPV indicates the proportion of significant p-values which indicate a real effect amongst all significant p-values. Put in other words: Given that a p-value is significant: What is the probability (in a frequentist sense) that it stems from a real effect?

(The false discovery rate simply is 1-PPV: the probability that a significant p-value stems from a population with null effect).

That is, we are interested in a conditional probability Prob(effect is real | p-value is significant).
Inspired by Colquhoun (2014) one can visualize this conditional probability in the form of a tree-diagram (see below). Let’s assume, we carry out 1000 experiments for 1000 different research questions. We now have to make a couple of prior assumptions (which you can make differently in the app we provide below). For now, we assume that 30% of all studies have a real effect and the statistical test used has a power of 35% with an α level set to 5%. That is of the 1000 experiments, 300 investigate a real effect, and 700 a null effect. Of the 300 true effects, 0.35*300 = 105 are detected, the remaining 195 effects are non-significant false-negatives. On the other branch of 700 null effects, 0.05*700 = 35 p-values are significant by chance (false positives) and 665 are non-significant (true negatives).

This path is visualized here (completely inspired by Colquhoun, 2014):

PPV_tree

Now we can compute the false discovery rate (FDR): 35 of (35+105) = 140 significant p-values actually come from a null effect. That means, 35/140 = 25% of all significant p-values do not indicate a real effect! That is much more than the alleged 5% level.

An interactive app

Together with Michael Zehetleitner I developed an interactive app that computes and visualizes these numbers. For the computations, you have to choose 4 parameters.

Let’s go through the settings!

Some of our investigated hypotheses are actually true, and some are false. As a first parameter, we have to estimate what proportion of our investigated hypotheses is actually true.

Now, what is a good setting for the a priori proportion of true hypotheses? It’s certainly not near 100% – in this case only trivial and obvious research questions would be investigated, which is obviously not the case. On the other hand, the rate can definitely drop close to zero. For example, in pharmaceutical drug development “only one in every 5,000 compounds that makes it through lead development to the stage of pre-clinical development becomes an approved drug” (Wikipedia). Here, only 0.02% of all investigated hypotheses are true.

Furthermore, the number depends on the field – some fields are highly speculative and risky (i.e., they have a low prior probability), some fields are more cumulative and work mostly on variations of established effects (i.e., in these fields a higher prior probability can be expected).

But given that many journals in psychology exert a selection pressure towards novel, surprising, and counter-intuitive results (which a priori have a low probability of being true), I guess that the proportion is typically lower than 50%. My personal grand average gut estimate is around 25%.

(Also see this comment and this reply for a discussion about this estimate).

That’s easy. The default α level usually is 5%, but you can play with the impact of stricter levels on the FDR!

The average power in psychology has been estimated at 35% (Bakker, van Dijk, & Wicherts, 2012). An median estimate for neuroscience is at only 21% (Button et al., 2013). Even worse, both estimates can be expected to be inflated, as they are based on the average published effect size, which almost certainly is overestimated due to the significance filter (Ioannidis, 2008). Hence, the average true power is most likely smaller. Let’s assume an estimate of 25%.

Finally, let’s add some realism to the computations. We know that researchers employ “researchers degrees of freedom”, aka. questionable research practices, to optimize their p-value, and to push a “nearly significant result” across the magic boundary. How many reported significant p-values would not have been significant without p-hacking? That is hard to tell, and probably also field dependent. Let’s assume that 15% of all studies are p-hacked, intentionally or unintentionally.

When these values are defined, the app computes the FDR and PPV and shows a visualization:

With these settings, only 39% of all significant studies are actually true!

Wait – what was the success rate of the Reproducibility Project: Psychology? 36% of replication projects found a significant effect in a direct replication attempt. Just a coincidence? Maybe. Maybe not.

The formula to compute the FDR and PPV are based on Ioannidis (2005: “Why most published research findings are false“). A related, but different approach, was proposed by David Colquhoun in his paper “An investigation of the false discovery rate and the misinterpretation of p-values” [open access]. He asks: “How should one interpret the observation of, say, p=0.047 in a single experiment?”. The Ioannidis approach implemented in the app, in contrast, asks: “What is the FDR in a set of studies with p <= .05 and a certain power, etc.?”. Both approaches make sense, but answer different questions.

See also Daniel Laken’s blog post about the same topic, and the interesting discussion below it.

To leave a comment for the author, please follow the link and comment on their blog: Nicebread » R.

↧

Literacy in India – A deepR dive

November 5, 2015, 7:39 pm

≫ Next: Building Interactive Maps with Leaflet

≪ Previous: What’s the probability that a significant p-value indicates a true effect?

(This article was first published on Giga thoughts ... » R, and kindly contributed to R-bloggers)

You can do magic!
You can have anything,
That you desire
Magic…
You can do magic – song by America (1982)

That is exactly how I feel when I write code in R. A few lines of R, lo behold, hundreds of rows and columns are magically transformed into easily understandable graphs, regression curves or choropleth maps. (By the way, the song is a really cool! Listen to it if you have not heard it before). You really can do magic with R

In this post I do a deep dive into literacy in India The dataset is taken from Open Government Data (OGD) platform India was used for this purpose. This data is based on the 2001 census. Though the data is a little dated, it is extremely rich with literacy details across different age groups, and over all Indian States. The data includes the total number of persons/males/females who are in the primary, middle.matric, college,technical diploma, non-technical diploma and so on. In fact the data also includes the educational background of people in the districts in each state. I slice and dice the data across multiple parameters. I have created an interactive Shiny App which will provide very detailed visualization based on the parameters chosen

Do try out my interactive Shiny app : IndiaLiteracy

The entire code for this app is on GitHub. Feel free to download/clone/fork/modify or enhance the code – literacyInIndia

For analyzing such a rich data set as the Census data of 2001, I create 4 tabs
1) State Literacy
2) Educational Levels vs Age
3) India Literacy and
4) District Literacy

Here are the details of these 4 tabs in my Shiny app

A) State Literacy
This tab provides the age wise distribution of people (Persons/Males/Females) who attend educational institutions. This is shown as a barplot. The plot also includes the national average. In the plot below which is for entire India we see that the national average

The distribution of females attending educational institutions in the state of Haryana is shown. Also included is the national average. As can be seen there are options for (Total/Urban/Rural) against (Persons/Males/Females) and whether these people attend educational institutions are illiterate of literate.

I also have another option under “Who’ which is “All” This will plot the age wise distribution of males/females/persons in urban/rural or entire state.

B. Educational Institutions vs Age plot

This plot displays the the educational institutions attended by people in a particular age group. So for example in the state of Orissa for the 18 year age group we can see that there persons who are in (Primary, Matric, Higher Secondary, Non-Technical Diploma and Technical Diploma). The bar length for each color is the percentage of the total persons at that level of education

C. Literacy across India
This tab plots a chorpleth map for a region(Urban+Rural, Urban, Rural), Who(Persons, Males, Females) and the literacy level (attending educational institutions, primary, higher secondary, Matric etc) across the whole of India.

D. Literacy within a state
This tab plots a chorpleth map of literacy in the districts of a state. A sample plot for Karnataka is shown below

E. Key observations

There is a wealth of insights you can glean by looking at the various charts. Here a few insights from my initial observations
1) The literacy in Kerala across ages is higher than the national average while in Bihar it is less than the national average

a) Kerala

b) Bihar

2) In Rajasthan The Males Attending education instituions is higher than the national average while for females it less than the national average. However the situation is reverse in Chandigarh where there are the percentage of females attending education instiuons is higher than the national average and the males

a) Rajasthan

b) Chandigarh

3) When we look at the number of persons attending educational institution across India the north-eastern states lead with Manipur, Nagaland and Sikkim in the top 3.

We have heard that Kerala is the most literate state. But it looks like Manipur, Nagaland, Sikkim actually edge Kerala out. If we look at the State literacy chart for Kerala and Manipur this becomes more clear

a) Kerala

b) Manipur

It can be seen that in Manipur the number of persons attending educational instition in the age range 13-24 years it is much higher than the national average and much higher than Kerala

4) If we take a look at the District wise literacy for the state of Bihar we see that the literacy is lower in the north eastern districts.,

5) Here is another interesting observation I made. The top 3 states which are most ‘literate with no education’ are i) Rajasthan ii) Madhya Pradesh iii) Chhattisgarh

While I have included several charts with accompanying explanation, this is largely unnecessary as most of the charts are self-explanatory.

Do try out the Shiny app and see for yourself the literacy in each state/district/age group educational level etc – IndiaLiteracy

Feel free to clone/fork my code and make your own enhancements –literacyInIndia

1. Natural Language Processing: What would Shakespeare say?
2. Introducing cricketr! : An R package to analyze performances of cricketers
3. Revisiting crimes against women in India
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Re-working the Lucy-Richardson Algorithm in OpenCV
6. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
7. Bend it like Bluemix, MongoDB with autoscaling – Part 2
8. TWS-4: Gossip protocol: Epidemics and rumors to the rescue
9. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
10. Simulating an Edge Shape in Android

To leave a comment for the author, please follow the link and comment on their blog: Giga thoughts ... » R.

↧

Building Interactive Maps with Leaflet

November 7, 2015, 1:10 am

≫ Next: Plotting Russian AiRstRikes in SyRia

≪ Previous: Literacy in India – A deepR dive

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Leaflet is an JavaScript library for building interactive maps. RStudio released a package that allows us to build these maps in R! You can do some really cool things in Leaflet, and I will demonstrate a few of those below. Leaflet is compatible with Shiny apps and R Markdown documents.

As mentioned on the RStudio page, the basic steps to create a Leaflet map are:

1. Create a map widget using leaflet()
2. Add layers to the map using addTiles(), addMarkers(), etc.
3. Print the map.

Okay, let’s get started. We will need the leaflet and magrittr packages for this.

Let’s create a basic map centered at San Francisco. First, we create a widget by calling leaflet(). Then, we add tiles to the map using addTiles(). By default, it uses Open Street Map tiles. If you would like to use other tiles, please refer here. We can set the desired longitude and latitude that we want the map to be centered at, and also set the zoom level using the setView() function. The addMarker() function allows us to add a marker at our desired location, with it’s own popup message!

library(leaflet)
library(magrittr)

SFmap <- leaflet() %>% 
  addTiles() %>% 
  setView(-122.42, 37.78, zoom = 13) %>% 
  addMarkers(-122.42, 37.78, popup = 'Bay Area')
SFmap

This gives the following map:

If you click the blue marker, you will see a small popup with the text ‘Bay Area’. There are a variety of options to customize the marker. For example, addCircleMarkers() lets you use circle shaped markers instead of the default ones. You can even add your own icons to use as markers (more info here and here)!

The above map with a circle marker can be built as follows:

SFmap <- leaflet() %>% 
  addTiles() %>% 
  setView(-122.42, 37.78, zoom = 13) %>% 
  addCircleMarkers(-122.42, 37.78, popup = 'Bay Area', radius = 5, color = 'red')

and looks likes this:

Let’s plot an interactive map showing the location of incidents reported to the San Francisco Police Department. The data is available here. Let’s read in the data and see what it looks like.

SFdata <- read.csv('SFPD_Incidents_-_Current_Year__2015_.csv')
head(SFdata)
  IncidntNum               Category                                              Descript DayOfWeek       Date  Time PdDistrict      Resolution                Address         X        Y
1  150927700               BURGLARY           BURGLARY OF APARTMENT HOUSE, UNLAWFUL ENTRY  Thursday 10/22/2015 23:59    CENTRAL            NONE 900 Block of SUTTER ST -122.4160 37.78823
2  150927700 SEX OFFENSES, FORCIBLE                                        SEXUAL BATTERY  Thursday 10/22/2015 23:59    CENTRAL            NONE 900 Block of SUTTER ST -122.4160 37.78823
3  150926570                ASSAULT                                               BATTERY  Thursday 10/22/2015 23:45   RICHMOND            NONE   100 Block of 19TH AV -122.4787 37.78508
4  150925312        STOLEN PROPERTY STOLEN PROPERTY, POSSESSION WITH KNOWLEDGE, RECEIVING  Thursday 10/22/2015 23:40   SOUTHERN JUVENILE BOOKED    0 Block of SPEAR ST -122.3949 37.79311
5  156262768          LARCENY/THEFT                               PETTY THEFT OF PROPERTY  Thursday 10/22/2015 23:40   SOUTHERN            NONE     MARKET ST / 6TH ST -122.4103 37.78223
6  150925312                ROBBERY                ROBBERY, ARMED WITH A DANGEROUS WEAPON  Thursday 10/22/2015 23:40   SOUTHERN JUVENILE BOOKED    0 Block of SPEAR ST -122.3949 37.79311

(There is another location column at the end which I omitted here as it is just a concatenation of the X and Y columns.)

Now let us look at how many incidents of each category were reported.

table(SFdata$Category)
                      ARSON                     ASSAULT                  BAD CHECKS                     BRIBERY                    BURGLARY          DISORDERLY CONDUCT DRIVING UNDER THE INFLUENCE 
                        255                       10711                          26                          59                        4749                         396                         360 
              DRUG/NARCOTIC                 DRUNKENNESS                EMBEZZLEMENT                   EXTORTION             FAMILY OFFENSES      FORGERY/COUNTERFEITING                       FRAUD 
                       3315                         473                         115                          25                          54                         561                        2523 
                   GAMBLING                  KIDNAPPING               LARCENY/THEFT                 LIQUOR LAWS                   LOITERING              MISSING PERSON                NON-CRIMINAL 
                         22                         286                       34683                         133                          24                        3755                       15360 
             OTHER OFFENSES     PORNOGRAPHY/OBSCENE MAT                PROSTITUTION                     ROBBERY                     RUNAWAY             SECONDARY CODES      SEX OFFENSES, FORCIBLE 
                      16143                           4                         261                        3054                         124                        1676                         687 
 SEX OFFENSES, NON FORCIBLE             STOLEN PROPERTY                     SUICIDE              SUSPICIOUS OCC                        TREA                    TRESPASS                   VANDALISM 
                         15                         777                          64                        4367                           1                        1093                        6309 
              VEHICLE THEFT                    WARRANTS                 WEAPON LAWS 
                       6361                        5378                        1346

To ensure that the map is plotted quickly, I am going to subset the data to only include observations pertaining to bribery or suicide.

data <- subset(SFPDdata, Category == 'BRIBERY' | Category == 'SUICIDE')

Now, let us plot all these incidents on the map!

SFMap % 
  addTiles() %>% 
  setView(-122.42, 37.78, zoom = 13) %>% 
  addMarkers(data = data, lng = ~ X, lat = ~ Y, popup = data$Category)
SFMap

The map looks like this:

Here, clicking on each marker will give a popup showing whether the incident which occurred at that particular location was bribery or suicide.

A lot of the markers are clumped together rather closely. We can cluster them together by specifying clusterOptions as follows:

SFMap % 
  addTiles() %>% 
  setView(-122.42, 37.78, zoom = 13) %>% 
  addCircleMarkers(data = data, lng = ~ X, lat = ~ Y, radius = 5, 
                   color = ~ ifelse(Category == 'BRIBERY', 'red', 'blue'),
                   clusterOptions = markerClusterOptions())

which will give the following map:

The number inside each circle represents the total number of incidents in that area. Areas with higher incidents are marked by yellow circles and areas with lower incidents are marked by green circles. When you click on a cluster, the map will automatically zoom into that area and split into smaller clusters or show the individual incidents depending on how zoomed in you are. The circle markers are colored red for incidents corresponding to bribery, and blue for incidents corresponding to suicide.

That brings us to the end of the article. I hope you enjoyed it. If you have any questions/feedback, please feel free to leave a comment or reach out to me on Twitter.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

Plotting Russian AiRstRikes in SyRia

November 7, 2015, 12:27 pm

≫ Next: Launch Apache Spark on AWS EC2 and Initialize SparkR Using RStudio

≪ Previous: Building Interactive Maps with Leaflet

(This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers)

“Who do we think will rise if Assad falls?”

“Do we have a “government in a box” that we think we can fly to Damascus and put into power if the Syrian army collapses, the regime falls and ISIS approaches the capital?”

“Have we forgotten the lesson of “Animal Farm”? When the animals revolt and take over the farm, the pigs wind up in charge.”

Patrick J. Buchanan

In my new book, “Mastering Machine Learning with R”, I wanted to include geo-spatial mapping in the chapter on cluster analysis. I actually completed the entire chapter doing a cluster analysis on the Iraq Wikileaks data, plotting the clusters on a map and building a story around developing an intelligence estimate for the Al-Doura Oil Refinery, which I visited on many occasions during my 2009 “sabbatical”. However, the publisher convinced me that the material was too sensitive for such a book and I totally re-wrote the analysis with a different data set. I may or may not publish it on this blog at some point, but I want to continue to explore building maps in R. As luck would have it, I stumbled into a data set showing the locations of Russian airstrikes in Syria at the following site:

http://russia-strikes-syria.silk.co/

The data includes the latitude and longitude of the strikes along with other background information. The what, how and why the data was collected is available here:

https://www.bellingcat.com/news/mena/2015/10/26/what-russias-own-videos-and-maps-reveal-about-who-they-are-bombing-in-syria/

In short, the site tried to independently verify locations, targets etc., plus includes what they claim are the reported versus actual strike locations. When I pulled the data there were 60 strikes analyzed by the site. They were unable to determine the locations of 11 of the strikes, so we have 49 data points.

I built the data in excel and put in a .csv, which I’ve already loaded. Here is the structure of the data.

> str(airstrikes)

‘data.frame': 120 obs. of 4 variables:

$ Airstrikes : chr “Strike 1″ “Strike 10″ “Strike 11″ “Strike 12″ …

$ Lat : chr “35.687782” “35.725846” “35.734952” “35.719518” …

$ Long : chr “36.786667” “36.260419” “36.073837” “36.072385” …

$ real_reported: chr “real” “real” “real” “real” …

> head(airstrikes)

Airstrikes Lat Long real_reported

1 Strike 1 35.687782 36.786667 real

2 Strike 10 35.725846 36.260419 real

3 Strike 11 35.734952 36.073837 real

4 Strike 12 35.719518 36.072385 real

5 Strike 13 35.309074 36.620506 real

6 Strike 14 35.817206 36.124503 real

> tail(airstrikes)

Airstrikes Lat Long real_reported

115 Strike 59 35.644864 36.338568 reported

116 Strike 6 35.740134 36.247029 reported

117 Strike 60 36.09346 37.085198 reported

118 Strike 7 35.702113 36.563525 reported

119 Strike 8 35.822472 36.018779 reported

120 Strike 9 35.725846 36.260419 reported

Since lat and long are character, I need to change them to numeric and also keep a subset of data of the actual/real strike locations.

> airstrikes$Lat = as.numeric(airstrikes$Lat)

Warning message:

NAs introduced by coercion

> airstrikes$Long = as.numeric(airstrikes$Long)

Warning message:

NAs introduced by coercion

> real=subset(airstrikes, airstrikes$real_reported==”real”)

I will be using ggmap for this effort and pull in google maps for plotting.

> library(ggmap)

Loading required package: ggplot2

Google Maps API Terms of Service: http://developers.google.com/maps/terms.

Please cite ggmap if you use it: see citation(‘ggmap’) for details.

> citation(‘ggmap’)

To cite ggmap in publications, please use:

D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The

R Journal, 5(1), 144-161. URL

http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

The first map will be an overall view of the country with the map type as “terrain”. Note that “satellite”, “hybrid” and “roadmap” are also available.

> map1 = ggmap(

get_googlemap(center=”Syria”, zoom=7, maptype=”terrain”))

With the map created as object “map1″, I plot the locations using “geom_point()”.

> map1 + geom_point(

data = real, aes (x = Long, y = Lat), pch = 19, size = 6, col=”red3″)

With the exception of what looks like one strike near Ar Raqqah, we can see they are concentrated between Aleppo and Homs with some close to the Turkish border. Let’s have a closer look at that region.

> map2 = ggmap(

get_googlemap(center=”Ehsim, Syria”, zoom=9, maptype=”terrain”))

> map2 + geom_point(data = real, aes (x = Long, y = Lat),

pch = 18, size = 9, col=”red2″)

East of Ghamam is a large concentration, so let’s zoom in on that area and add the strike number as labels.

> map3 = ggmap(

get_googlemap(center=”Dorien, Syria”,zoom=13, maptype=”hybrid”))

> map3 + geom_point(

data = real, aes (x = Long, y = Lat),pch = 18, size = 9, col=”red3″) +

geom_text(data=real,aes(x=Long, y=Lat, label=Airstrikes),

size = 5, vjust = 0, hjust = -0.25, color=”white”)

The last thing I want to do is focus in on the site for Strike 28. To do this we will require the lat and long, which we can find with the which() function.

> which(real$Airstrikes ==”Strike 28″)

[1] 21

> real[21,]

Airstrikes Lat Long real_reported

21 Strike 28 35.68449 36.11946 real

It is now just a simple matter of using those coordinates for calling up the google map.

> map4 = ggmap(
get_googlemap(center=c(lon=36.11946,lat=35.68449), zoom=17, maptype=”satellite”))

> map4 + geom_point(

data = real, aes (x = Long, y = Lat), pch = 22, size = 12, col=”red3″)

+ geom_text(data=real,aes(x=Long, y=Lat, label=Airstrikes),

size = 9, vjust = 0, hjust = -0.25, color=”white”)

From the looks of it, this seems to be an isolated location, so it was probably some sort of base or logistics center. If you’re interested, the Russian Ministry of Defense posts videos of these strikes and you can see this one on YouTube.

https://www.youtube.com/watch?v=Ape5grS9MEM

OK, so that is a quick tutorial on using ggmap, a very powerful package. We’ve just scratched the surface of what it can do. I will continue to monitor the site for additional data. Perhaps publish a Shiny app if the data is large and “rich” enough.

Cheers,

To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.

↧

Launch Apache Spark on AWS EC2 and Initialize SparkR Using RStudio

November 10, 2015, 1:10 am

≫ Next: EARL 2015 in Boston: R Conference write-up by an attendee

≪ Previous: Plotting Russian AiRstRikes in SyRia

(This article was first published on SparkIQ Labs Blog » R, and kindly contributed to R-bloggers)

Introduction

In this blog post, we shall learn how to launch a Spark stand alone cluster on Amazon Web Services (AWS) Elastic Compute Cloud (EC2) for analysis of Big Data. This is a continuation from our previous blog, which showed us how to download Apache Spark and start SparkR locally on windows OS and RStudio.

We shall use Spark 1.5.1 (released on October 02, 2015) which has a spark-ec2 script that is used to install stand alone Spark on AWS EC2. A nice feature about this spark-ec2 script is that it installs RStudio server as well. This means that you don’t need to install RStudio server separately. Thus you can start working with your data immediately after Spark is installed.

Prerequisites

You should have already downloaded Apache Spark onto your local desktop from the official site. You can find instructions on how to do so in our previous post.
You should have an AWS account, created secret access key(s) and downloaded your private key pair as a .pem file. Find instructions on how to create your access keys here and to download your private keys here.
We will launch the clusters through Bash shell on Linux. If you are using Windows OS I recommend that you install and use the Cygwin terminal (It provides functionality similar to a Linux distribution on Windows)

Launching Apache Spark on AWS EC2

We shall use the spark-ec2 script, located in Spark’s ec2 directory to launch, manage and shutdown Spark clusters on Amazon EC2. It will setup Spark, HDFS, Tachyon, RStudio on your cluster.

Step 1: Go into the ec2 directory

Change directory into the “ec2″ directory. In my case, I downloaded Spark onto my desktop, so I ran this command.

$ cd Desktop/Apache/spark-1.5.1/ec2

Step 2: Set environment variables

Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your Amazon EC2 access key ID and secret access key.

$ export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU

$ export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123

Step 3: Launch the spark-ec2 script

Launch the cluster by running the following command.

$ ./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-east-1 --instance-type=c3.4xlarge -s 2 --copy-aws-credentials launch test-cluster

Where;

–key-pair=<name_of_your_key_pair> , The name of your EC2 key pair
–identity-file=<name_of_your_key_pair>.pem , The private key file
–region=<the_region_where_key_pair_was_created>
–instance-type=<the_instance_you_want>
-s N, where N is the number of slave nodes
“test-cluster” is the name of the cluster

In case you want to set other options for the launch of your cluster, further instructions can be found on the Spark documentation website.

As I mentioned earlier, this script also installs RStudio server, as can be seen in the figure below.

The cluster installation takes about 7 minutes. When it is done, the host address of the master node is displayed at the end of the log message as shown in the figure below. At this point your Spark cluster has been installed successfully and you are a ready to start exploring and analyzing your data.

4-done

Before you continue, you may be curious to see whether your cluster is actually up and running. Simply log into your AWS account and go to the EC2 dashboard. In my case, I have 1 master node and 2 slave/worker nodes in my Spark cluster.

Use the address displayed at the end of the launch message and access the Spark User Interface (UI) on port 8080. You can also get the host address of your master node by using the “get-master” option in the command below.

$ ./spark-ec2 --key-pair=awskey --identity-file=awskey.pem get-master test-cluster

Step 4: Login to your cluster

In the terminal you can login to your master node by using the “login” option in the following command

$ ./spark-ec2 --key-pair=awskey --identity-file=awskey.pem login test-cluster

Step 5 (Optional): Start the SparkR REPL

Here you can actually start the SparkR REPL by typing the following command.

$ spark/bin/sparkR

SparkR will be initialized and you should see a welcome message as shown in the Figure below. Here you can actually start working with your data. However most R users, like myself, would like to work in an Integrated Development Environment (IDE) like RStudio. See steps 6 and 7 on how to do so.

Step 6: Create user accounts

Use the following command to list all available users on the cluster.

$ cut -d: -f1 /etc/passwd

You will notice that “rstudio” is one of the available user accounts. You can create other user accounts and passwords for them using these commands.

$ sudo adduser daniel

$ passwd daniel

In my case, I used the “rstudio” user account and changed its password.

Initializing SparkR Using RStudio

The spark-ec2 script also created a “startSpark.R” script that we shall use to initialize SparkR.

Step 7: Login to RStudio server

Using the username you selected/created and the password you created, login into RStudio server.

Step 8: Initialize SparkR

When you log in to RStudio server, you will see the “startSpark.R” in your files pane (already created for you).

Simply run the “startSpark.R” script to initialize SparkR. This creates a Spark Context and a SQL Context for you.

Step 9: Start Working with your Data

Now you are ready to start working with your data.

Here I use a simple example of the “mtcars” dataset to show that you can now run SparkR commands and use the MLLib library to run a simple linear regression model.

You can view the status of your jobs by using the host address of your master and listening on port 4040. This UI also displays a chain of RDD dependencies organized in Direct Acyclic Graph (DAG) as shown in the figure below.

Final Remarks

The objective of this blog post was to show you how to get started with Spark on AWS EC2 and initialize SparkR using RStudio. In the next blog post we shall look into working with actual “Big” datasets stored in different data stores such as Amazon S3 or MongoDB.

Further Interests: RStudio Shiny + SparkR

I am curious about how to use Shiny with SparkR and in the next couple of days I will investigate this idea further. The question is: how can one use SparkR to power shiny applications. If you have any thoughts please share them in the comments section below and let’s discuss.

Filed under: Apache Spark, AWS, Big Data, Data Science, R, RStudio, SparkR Tagged: Apache Spark, AWS, Big Data, Data Science, R, RStudio, SparkR

To leave a comment for the author, please follow the link and comment on their blog: SparkIQ Labs Blog » R.

↧

EARL 2015 in Boston: R Conference write-up by an attendee

November 10, 2015, 2:19 am

≫ Next: Free Webinar: Learn to Map Unemployment Data in R

≪ Previous: Launch Apache Spark on AWS EC2 and Initialize SparkR Using RStudio

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Written by Ben Young, an EARL Boston attendee. http://completelyabsorbed.com/

A week ago I flew to Boston, Massachusetts for EARL 2015. This was my first business trip, and as such I was very excited. The conference, speakers, and attendees did not disappoint. Mango Solutions put on a great schedule of workshops and talks. I would recommend attending future EARLs to anyone using R professionally.

Below is a summary of my time at EARL 2015.

Day 0
Workshops

Introduction to Rcpp
Dirk Eddelbuettel
Dirk gave a great introduction to Rcpp from the ground up. I am a novice C++ developer, and I was able to follow along from start to finish. I feel inspired to start working with Rcpp myself. Two ideas are (a) write a package facilitating connection with MIDI devices, to enable producing music from R, and (b) a fast poker evaluator based on a fast C++ evaluator.

Interactive Reporting with RMarkdown and Shiny
Garrett Grolemund
I taught myself Shiny about a year ago. Several months ago I started experimenting with RMarkdown. Garrett gave an enlightening, in-depth workshop on how to combine the two, with a result that is highly efficient and consistent.

Day 1
Talks

Monday morning I walked from my bed and (not) breakfast about a mile from the NERD center. Cambridge is beautiful, I highly recommend exploring.

R in Market Research – Handling ‘Wide'(not big) Data
Shad Thomas – Glass Box Research
A good approach at answering the question “how do you smartly reduce the amount of data without losing meaning?”

Heuristic Methods for Real World Optimization
Brandon Bass – Altenex LLC
Brandon’s talk has me interested in learning more about (a) Particle Swarm Optimization and (b) Evolutionary Algorithms. There was a funny, very meta moment when Brandon binged and googled ‘What is optimization?’. Being in a Microsoft building, it seemed fitting that Bing gave an excellent answer, and Google’s was off base.

How to do Survival Analysis of Health Data in R
Monika Wahi – DethWench Professional Services
I particularly appreciated Monika’s talk, as I do Survival Analysis as part of my work from time to time. Monika’s quote “I prefer logisitic regression, someone’s either dead or alive, and that’s pretty clear, linear regression is kind of waffley.” Her talk highlighted three approaches to survival analysis – parametric, semi parametric, and non parametric. In Monika’s line of work, semi parametric models are frequently used, specifically Cox model. The non parametric Kaplan Meier is also frequently used.

Predictive Models for Neglected Disease Drug Discovery
Paul Kowalczyk – Syngenta Biotechnology
Paul’s background is in drug design. He led an engaged discussion using Shiny. Paul illustrated techniques in drug development using machine learning techniques such as random forest, SVM, and KNN. I especially appreciated Paul focusing on literate programming – being sure someone can run your code without you in the room, big things don’t need to be explained.

Visualizing Models
Jared Lander – Lander Analytics
Jared generated some beautiful, enlightenting graphics. My favorite was the visualization of elastic nets with coefficient paths.

Visualization and Sensitivity Analysis of PK/PD Models in R
Yan Li – Celgene
Yan Li gave a compelling talk advocating for a paradigm shift to model based drug development. The methods usually followed now cost billions of dollars from start to finish for a drug. Yan broke down the processes of drug development, addressing issues, innovative solutions, and more.

Sharing Data between R and non R users
Aimee Gott – Mango Solutions
Aimee’s talk was my favorite of the conference. She talked at length about a solution Mango developed, their client wanted their R users to be able to seamlessly collaborate with their Excel users. She also gave a visual tour of how this solution manifested. Overall a very impressive application.

Customizing R Machine Learning to Your Problem with Caret
Marcos Pereira – Millward Brown
Marcos covered the caret package, customizing the summary function, and customizing the caret models. A great exploration of the package.

Creating Rich Analytic Presentations with the RCloud Framework
– Doug Ashton – Mango Solutions
Doug demoed RCloud, a product compared to ipython notebook. To me it looks like the perfect toolbox for implementing finely tuned scripts. I’m currently trying to get RCloud running on a remote machine, though admittedly the process is quite difficult.

Day 2
Talks continued

Opening Keynote 1
Richard Pugh – Mango Solutions
Richard gave an inspiring keynote, my favorite note of which is that going out and hiring “unicorns” is not reasonable. Richard showed tools that had been made to assist in scoring employees and prospective hirees, and how he used these tools to “build a unicorn.”

Opening Keynote 2
Garrett Grolemund – RStudio
Garrett’s keynote was very interesting. He talked about how his career began as a psychologist, moving on speaking about how the brain processes information, that everything we perceive is inherently flawed. He made a lot of well-placed references to The Matrix. My biggest take away was that a data scientist’s job is to determine what the truth about reality is.

Measuring Brand Ad Effectiveness
Tim Hesterberg – Google
Tim gave a history of consumer surveys, as well as how Google collects, filters, and fits data from their surveys today. Tim kept his talk fresh and interesting by giving a narrative from the side of the sales department, as well as the side of the survey taker.

Performance Attribution for Equity Portfolios
Yang Lu – Hutchin Hill Capital
I spoke with Yang before his talk. He told me he’s been using R since college, that his workplace is very R friendly, and his bosses love R. His talk addresed the question “how do we measure portfolio performance?” Yang’s answer utilized a Brinson model, as well as a regression based approach.

Using R and Bioconductor in Cancer Genetics and Precision Medicine
Aedin Culhane – Dana-Farber Cancer Institute and Harvard TH Chan School of Public Health
Aedin opened her talk by recalling the Horse Manure Crisis of 1894, as an example of shortsighted modeling. Many were panicking about the growing amounts of horse manure, with no end in sight, and the advent of the automobile stopped this problem. Following, Aedin’s talk explored personalized medicine, and genome sequencing (of which R plays a large role, in the Bioconductor libary).

Quantitative Portfolio Management with High Frequency Data
Jerzy Pawlowski – NYU Polytechnic School of Engineering
Jerzy showed some methods of implementing portfolio management, and spent some time discussing Garman-Klass, as well as Rogers-Satchell estimators. A bit of this talk was beyond my current level of understanding.

Garbage In, Garbage Out – Automating Data Quality
Rob Weyrauch
Rob described how EarlyWarning was created, for detecting fraud in banks, and wholly owned by 5 major banks.

A declarative DSL for the plotly graphing library in R
Jack Parmer – Plotly
Plotly has recently released their javascript code. Jack showed that plotly can be used in Shiny apps, and showed that plotly graphs can be easily edited, convenient and usable for non-r people.

Deploying predictive models as APIs
Sean Lorenz – Domino Data Labs
Sean demoed his company’s product, which allows predictive models to be called via API. This product is not only cloud-based, but also available as an on-premise release.

Predicting Student Success at Scale: APIs and DSLs for Building and Integrating Many Models
Harlan Harris – Educational Advisory Board
Harlan talked about his company, and some principles that ring true : “the data science team does the data science”, and “use tools that you know to build tools that you’ll use”.

Scaling R for Real-world Business Analytics
Roger Fried – Teradata
Roger gave a demo of Teradata’s AsterR, highlighting it’s abilities to easily perform glm-like operations on billions of rows.

EARL 2015 : Boston was a fantastic time, and I learned a lot. I’m motivated to put this new knowledge to work, and plan to post more, interesting posts very soon.

Thank you Ben for your blog post, to see some of Ben’s pictures head over to http://completelyabsorbed.com/

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

↧

Free Webinar: Learn to Map Unemployment Data in R

November 10, 2015, 3:36 pm

≫ Next: Coming Soon -The Data Science Radar: The Pursuit of the Unicorn!

≪ Previous: EARL 2015 in Boston: R Conference write-up by an attendee

(This article was first published on AriLamstein.com » R, and kindly contributed to R-bloggers)

Last month I ran my first webinar (“Make a Census Explorer with Shiny”). About 100 people showed up, and feedback from the participants was great. I also had a lot of fun myself. Because of this, I’ve decided to do one more webinar before my free trial with the webinar service ends. Here are the details:

Title: Map US Unemployment Data with R, Choroplethr and Shiny

When: Thursday, November 19 12:00 PM – 1:00 PM PST

Agenda: In this free webinar I will explain:
-How US Unemployment data is measured and disseminated
-How to access the data in R
-How to map the data in R
-How to map the data with Shiny
-Q&A

Click here to register for the webinar.

Space is limited, so I recommend that you sign up soon to reserve your seat.

Example Analysis Using Unemployment Data

The following code generates a boxplot of US State Unemployment data. Note the dramatic jump between 2008 and 2009.

library(rUnemploymentData)
data(df_state_unemployment)

boxplot(df_state_unemployment[, 2:ncol(df_state_unemployment)],
    main = "US State Unemployment Rates",
    xlab = "Year",
    ylab = "Percent Unemployment")

The boxplot tells us that unemployment rates in US States jumped dramatically between 2008 and 2009. However, it does not tell us how individual states were effected. To do this we can calculate the percent change between the years and create a choropleth map:


library(choroplethr)
library(choroplethrMaps)

df_state_unemployment$value = (df_state_unemployment$"2009" - df_state_unemployment$"2008") / df_state_unemployment$"2008" * 100

state_choropleth(df_state_unemployment, 
    title      = "Percent Change in US State Unemploymentn2008-2009",
    legend     = "% Change",
    num_colors = 1)

Here we can see that Utah had the largest jump in unemployment between 2008 and 2009.

The post Free Webinar: Learn to Map Unemployment Data in R appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: AriLamstein.com » R.

↧

Coming Soon -The Data Science Radar: The Pursuit of the Unicorn!

November 11, 2015, 9:52 am

≫ Next: Bioenergetics in R Workshop

≪ Previous: Free Webinar: Learn to Map Unemployment Data in R

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

data science radar with white background

By Hannah Evans

“By 2018, the United States will experience a shortage of 190,000 skilled data scientists and 1.5 million managers and analysts capable of reaping actionable insights from the big data deluge.” (McKinsey)

The ‘data scientist’: an increasingly valuable, yet rare breed of individual, indispensable to any organisation of the 21st century in order to remain competitive. And yet what exactly is a ‘data scientist’? What skills are required to reap the aforementioned “actionable insights” from the abundance of “big data”?

In collaboration with our customers, Mango Solutions has drawn upon industry expertise to establish 6 core attributes of the contemporary ‘data scientist’: Communicator, Visualiser, Modeller, Programmer, Technologist and Data Wrangler.

To measure these skillsets, we have been developing the Data Science Radar; a conceptual framework that allows users to explore these character traits in more detail, to uncover their own data science profile type. An individual with high scores in all 6 character traits is an extremely unique breed of data scientist indeed – the metaphorical ‘unicorn’.

Chief Data Scientist at Mango Solutions, Rich Pugh premiered this concept using a ‘Shiny’ web app to audiences at our annual EARL (Effective Applications of the R Language) conferences in Boston and London this quarter. Rich emphasised how essential the tool has become internally at Mango – “Data science is a broad and multi-faceted domain; as a team of data scientists, it is critical that our skillsets as a group are just as diverse – to match the ever changing boundaries of the unique field we operate in. The Data Science Radar allows us to do just that.”

It provides a visual map of the skillsets within a data science facility – on an individual level, it can be used to understand personal strengths and potential areas for improvement. On an enterprise level, the Data Science Radar could be used to help team leaders in the following ways:

1. As a visual aid to support data scientist recruitment requirements – to highlight where there are gaps in the skillset of an existing team and to identify the skillsets of potential new recruits.

2. To help shape bespoke data science training courses for teams by objectively identifying training requirements of your existing staff.

3. To monitor learning during any long-term data science training programme.

Watch this space…the first iteration of the Data Science Radar is due to launch online in the next couple of weeks- you will be able to create your own radar via the Mango Solutions website, determine your own profile and share on social media @DSradar #DSradar. Get following for updates on our progress. How far away will your unicorn be?

In the meantime, why not get in touch to see how Mango Solutions can use the Data Science Radar to help you meet your data science requirements?

Telephone: +44 (0)1249 705 450
Email: training@mango-solutions.com

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

↧

Bioenergetics in R Workshop

November 14, 2015, 10:00 pm

≫ Next: 17 (new) R jobs from around the world (for 2015-11-16)

≪ Previous: Coming Soon -The Data Science Radar: The Pursuit of the Unicorn!

(This article was first published on fishR Blog, and kindly contributed to R-bloggers)

It was just brought to my attention that there will be a workshop at the upcoming Midwest Fish and Wildlife Conference (Grand Rapids, MI) on the Bioenergetics 4.0 shiny app. The announcement from here (where there is a registration link) is below (I added the links):

Instructors:

Dr. David Deslauriers, Post Doctoral Fellow, Department of Biological Sciences, University of Manitoba
Dr. Steven R. Chipps, Unit Leader, USGS South Dakota Cooperative Fish & Wildlife Research Unit, Department of Natural Resource Management, South Dakota State University

Bioenergetics models are widely used as a tool in fisheries management and research. Although Fish Bioenergetics 3.0 (Hanson et al. 1997) remains a popular software package, it is now over 18 years old and is incompatible with many new operating systems. Moreover, since Fish Bioenergetics 3.0 was released, the number of published fish bioenergetics models has increased from 33 to 115 models. This workshop will introduce Fish Bioenergetics 4.0, an R-based platform that consists of a graphical user interface application (Shiny by RStudio). Instructors will provide an overview of bioenergetics concepts and applications, and introduce attendees to the new modeling platform. Example exercises and group projects will be covered to aid in navigating the software and to answer basic and applied questions in fish ecology.

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

17 (new) R jobs from around the world (for 2015-11-16)

November 16, 2015, 12:46 pm

≫ Next: PubMed search Shiny App using RISmed

≪ Previous: Bioenergetics in R Workshop

This is the bi-monthly R-bloggers post (for 2015-11-16) for new R Jobs.

To post your R job on the next post

Just visit this link and post a new R job to the R community (it’s free and quick).

New R jobs

Job seekers: please follow the links below to learn more and apply for your job of interest:

Full-Time

BioStatistician / Quantitative Ecologist @ Saint Petersburg, Florida, United States
Florida Fish and Wildlife Conservation Commission – Posted by ehleone

Saint Petersburg
Florida, United States

16 Nov2015
Freelance

Looking for a Freelance R-Developer with Shiny experience
Neuronworld – Posted by Pal

Manassas
Virginia, United States

15 Nov2015
Full-Time

Data Analyst @ New York
Vericred – Posted by vericred1

New York
New York, United States

13 Nov2015
Freelance

R/Shiny App with d3 (small job, quick turnaround, $250 < 4hrs)
mariereilly

Anywhere

13 Nov2015
Full-Time

Postdoctoral position (Duke’s River center) in data visualization and ecosystem science @ Durham, North Carolina, United States
Duke University – Posted by emily.bernhardt

Durham
North Carolina, United States

11 Nov2015
Full-Time

Software Engineer for The Computational Biology Program at Oregon Health and Science University @ Portland, Oregon, United States
Oregon Health & Science University – Posted by takabaya

Portland
Oregon, United States

10 Nov2015
Full-Time

Data Scientist @ Israel
rachel

South District
Israel

10 Nov2015
Full-Time

Jr Financial Engineer
Hull Investments – Posted by robertp

Chicago
Illinois, United States

9 Nov2015
Full-Time

Postdoc (for 2 years) @ Beaufort / North Carolina
NRC – Posted by tsvn

Beaufort
North Carolina, United States

9 Nov2015
Full-Time

Junior and Senior Data Scientist @ Medellín / Colombia
IDATA – inteligencia analítica – Posted by idata_co

Medellín
Antioquia, Colombia

9 Nov2015
Temporary

Shiny GUI developer (small gig)
pete_and_co – Posted by peter.shepard

Anywhere

7 Nov2015
Full-Time

Senior Research Associate (Statictician) @ Utrecht, Netherlands
Mapi Group – Posted by akarabis

Utrecht
Utrecht, Netherlands

6 Nov2015
Full-Time

R developer @ Boston
DataRobot – Posted by DataRobot

Boston
Massachusetts, United States

5 Nov2015
Full-Time

Data Scientist @ Southfield / Michigan
w3r Consulting – Posted by dlogan@w3r.com

Southfield
Michigan, United States

5 Nov2015
Full-Time

Senior Data Scientist @ Petah Tikva / Israel
Intel’s Big Data Analytics Group @ Israel – Posted byLeonid.S

Petah Tikva
Center District, Israel

4 Nov2015
Full-Time

Data Scientist @ Newton/Massachusetts
Silent Spring Institute – Posted by janetsilentspring

Newton
Massachusetts, United States

4 Nov2015
Full-Time

Data Engineer @ New York
Harmony Institute – Posted by jamie@harmony-institute.org

New York
New York, United States

2 Nov2015

(In R-users.com you may see all the R jobs that are currently available)

r_jobs

(you may also look at previous R jobs posts).

↧

PubMed search Shiny App using RISmed

November 16, 2015, 10:36 pm

≫ Next: The R-Podcast Episode 14: Tips and Tricks for using R-Markdown

≪ Previous: 17 (new) R jobs from around the world (for 2015-11-16)

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In part one of a series of tutorials, we will develop a Shiny App for performing analysis of academic text from PubMed. There’s no shortage of great tutorials for developing a Shiny App using R, including Shiny’s own tutorial. Here at datascience+ we have a perfect introduction by Teja Kodali and a more in-depth development by J.S. Ramos. Here I will focus on the basics of making PubMed queries using the RISmed package, and to demonstrate how easily you can share any of your R functionality using Shiny. Click here to see the App in action and follow along. In this introductory tutorial, we’ll get a taste of what we can accomplish, try to cover all the basics, and hopefully streamline some potential time-sinks.

About PubMed and RISmed

PubMed is a public query database of journal articles and other literature maintained and made available by the National Institutes of Health. The RISmed packages extracts content from the Entrez Programming Utilities (E-Utilities) interface to the PubMed query and database at NCBI into R. You will find great tutorials to get RISmed up and running by checking out this introduction or this very nice post by some dude named Dave Tang. You can find Stephanie Kovalchik’s terrific RISmed github page here. PubMed is a perfect place to search for scientific and health-related text, and coupled with Natural Language Processing tools in R, we can create powerful Meta-Analyses. Let’s get started!

Placeholder UI

Creating a shiny app requires the customary server.R and ui.R scripts in an empty directory. We will tackle the user interface head-on in the next post, so for now let’s get up and running with the following:

  
  library(shiny)
  library(shinythemes)
  shinyUI(fluidPage(theme=shinytheme("united"),
                    
                    headerPanel("PubMed Search"),
                    sidebarLayout(
                      sidebarPanel(
                        helpText("Type a word below and search PubMed to find documents that contain that word in the text.
                                 You can even type multiple words. You can search authors, topics, any acronym, etc."),
                        textInput("text", label = h3("Keyord(s)"), value = "carson chow"),
                        helpText("You can specify the start and end dates of your search, use the format YYYY/MM/DD"),
                        textInput("date1", label = h3("From"),value="1990/01/01"),
                        textInput("date2", label = h3("To"),  value = "2015/11/07"),
                        helpText("Now select the output you'd like to see. 
                                 You can see a barplot of articles per year, a wordcloud of the abstract texts, or a table of the top six authors"),
                        actionButton("goButton","PLOT"),
                        actionButton("wordButton","WORDS"),
                        actionButton("authButton","AUTHORS")
                       ),
                      

                      mainPanel(
                        plotOutput("distPlot"),
                        plotOutput("wordPlot"),
                        tableOutput("authList")
                      )
                      )))

We are designing an out-of-the-box Shiny UI with text entry boxes that allow the user to type their keyword(s), and specify a time frame using a start date and end date. We need to provide action buttons for users to select the output of their search instead of running it automatically, otherwise we run the risk of offending the E-Utilities servers with too many searches per second as the user types their query word (and looking glitchy in the meantime).

Act I

We will go through the server.R code in three parts, and demonstrate the RISmed package as we go. Basically we are developing three calls to E-Utilities, one for each action button on the UI. After loading all necessary packages, we activate the distPlot output expression for the user’s keyword when the user clicks the goButton action button, which the user sees on the UI as “PLOT”. RISmed then searches PubMed, and we construct a barplot of documents per year and add a line representing the sum of documents to date containing that keyword:

library(shiny)
library(SnowballC)
library(qdap)
library(ggplot2)
library(RISmed)
library(wordcloud)

shinyServer(function(input, output) {
  word1<-eventReactive(input$goButton, {input$text})
  
  
  output$distPlot <- renderPlot({
    
    d1<-input$date1
    d2<-input$date2
    
    res <- EUtilsSummary(word1(), type="esearch", db="pubmed", datetype='pdat', mindate=d1, maxdate=d2, retmax=500)
    date()
    fetch <- EUtilsGet(res, type="efetch", db="pubmed")
    count<-table(YearPubmed(fetch))
    count<-as.data.frame(count)
    names(count)<-c("Year", "Counts")
    num <- data.frame(Year=count$Year, Counts=cumsum(count$Counts)) 
    num$g <- "g"
    names(num) <- c("Year", "Counts", "g")
    library(ggplot2)
    q <- qplot(x=Year, y=Counts, data=count, geom="bar", stat="identity")
    q <- q + geom_line(aes(x=Year, y=Counts, group=g), data=num) +
      ggtitle(paste("PubMed articles containing '", word1(), "' ", "= ", max(num$Counts), sep="")) +
      ylab("Number of articles") +
      xlab(paste("Year n Query date: ", Sys.time(), sep="")) +
      labs(colour="") +
      theme_bw()
    q 
  })

Here is the plot we made:

We have created an object res that contains the results of the EUtilsSummary function in RISmed, according to its arguments. In our case we are searching for the user’s keyword using an esearch, specifying the PubMed database (there are others), and we wish to search by date of publication to PubMed in the user’s date range (be sure to use YearPubmed to avoid NA’s). We will timestamp the user’s query on the plot using date(). Then we create another object called fetch, where we store the results of the EUtilsGet function that performs an efetch on our summary. This fetch object is sub-settable, and the year it was published to PubMed is accessed by YearPubmed. We use the table function to count the number of documents for each year to construct our plot. RISmed is more than just a wrapper for E-Utilities, it provides a seamless query and produces organized output ready for us to analyze.

Act II

In the second section of server.R, we produce a wordcloud of the most frequent terms in the abstracts of the documents that contain the user’s keyword. Again, we will wait to construct the wordcloud until the user clicks wordButton, which is the action button labeled “WORDS” on the UI:

  
  word2<-eventReactive(input$wordButton, {input$text})
  
  output$wordPlot<-renderPlot({
    d1<-input$date1
    d2<-input$date2
    res <- EUtilsSummary(word2(), type="esearch", db="pubmed", datetype='pdat', mindate=d1, maxdate=d2, retmax=500)
    fetch <- EUtilsGet(res, type="efetch", db="pubmed")
    articles<-data.frame('Abstract'=AbstractText(fetch))
    abstracts<-as.character(articles$Abstract)
    abstracts<-paste(abstracts, sep="", collapse="") 
    wordcloud(abstracts, min.freq=10, max.words=70, colors=brewer.pal(7,"Dark2"))
  })

Here is the wordcloud:

Recall that in our first search, we accessed the publication years using YearPubmed(fetch), now we access the abstracts using AbstractText(fetch). This is the beauty of RISmed, we can access many fields of an EUtilsGet object in our R environment using the approach above. This documentation page, as well as googling RISmed is a good place to learn more. Shiny then easily allows us to share our findings and add interactivity.

Act III

Wordclouds are a fun way to visualize common words, and get a qualitative sense of the words associated with the user’s keyword. Future analyses will require that our outputs be in data frames, so in anticipation we will produce our third output in tabular format. When the user clicks the authButton action button titled “AUTHORS” on the UI, we again pass the user’s keyword to a new EUtilsSummary, and grab the author names using Author(fetch):

  word3<-eventReactive(input$authButton, {input$text})
  
  output$authList<-renderTable({
    d1<-input$date1
    d2<-input$date2
    res <- EUtilsSummary(word3(), type="esearch", db="pubmed", datetype='pdat', mindate=d1, maxdate=d2, retmax=500)
    fetch <- EUtilsGet(res, type="efetch", db=" pubmed")
    AuthorList<-Author(fetch)
    LastFirst<-sapply(AuthorList, function(x)paste(x$LastName,x$ForeName))
    auths<-as.data.frame(sort(table(unlist(LastFirst)), dec=TRUE))
    colnames(auths)<-"Count"
    auths <- cbind(Author = rownames(auths), auths)
    rownames(auths) <- NULL
    auths<-head(auths, 6) 
  })

This is how the author table looks:

Note that RISmed stored the author information in a list (of documents) of data frames (of authors), so in this case we used sapply to extract each data frame (watch out for this occasional wrinkle).

Denouement

Remember that the most important component of our App is its benefit to the user. Here we have created a platform where a user can search a scientist’s name and quickly learn who that individual collaborates with, or collaborated with during a certain timeframe, how many publications that person has on PubMed, when they were added, as well as an idea of what type of research that scientist does. A user can also search drug names or molecule acronyms, and see the names of the researchers who are most associated with that keyword in PubMed. The user could then paste that name into the keyword box and see a wordcloud that might shed light on lab techniques or targets, or suggest that particular person’s field of interest.

Coming soon…

In the next tutorial, we want to lay some groundwork for sharing future analyses by getting more control over our UI. We will see a few easy solutions provided by our good friends at Shiny, as well as walk through the process of creating a fully customized UI with our own html code. Also, check back soon for a link to a brief demonstration of this App for more ideas, and maybe some additional features.

Eventually we will be introducing other R packages that specialize in Natural Language Processing and sophisticated machine learning to peek at how data science is transforming the medical literature landscape. See you then!

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

Custom widget components

Widget layout

The javascript …

Making the widget behave

Shiny registration

The R code …

Generating the HTML for the widget

See for yourself

To get your R job on the next post

New R jobs

What does Shiny do?

What is Shiny?

How do I build a Shiny app?

How difficult is Shiny?

Alternatives to Shiny

Shiny Server Licenses

To post your R job on the next post

New R jobs

A common misconception about p-values

The False Discovery Rate (FDR) and the Positive Predictive Value (PPV)

An interactive app

Day 0 Workshops

Day 1 Talks

Day 2 Talks continued

Click here to register for the webinar.

Example Analysis Using Unemployment Data

To post your R job on the next post

New R jobs

About PubMed and RISmed

Placeholder UI

Act I

Act II

Act III

Denouement

Coming soon…

Day 0
Workshops

Day 1
Talks

Day 2
Talks continued