Tuesday, April 19, 2016

Beginner's guide to R: Introduction

Interest in the R statistics language is soaring, especially for data analysis. Get to know this hot programming option in our beginner's guide


R is hot. Whether measured by more than 4,400 add-on packages, the 18,000-plus members of LinkedIn's R group, or the close to 80 R Meetup groups currently in existence, there can be little doubt that interest in the R statistics language, especially for data analysis, is soaring.
cloud collaboration tools gears suites business
More than simply another collaboration solution, Slack has RESTful APIs that let you exchange data with
READ NOW
Why R? It's free, open source, powerful, and highly extensible. "You have a lot of prepackaged stuff that's already available, so you're standing on the shoulders of giants,"Google's chief economist told the New York Times back in 2009.
[ Learn how to work smarter, not harder with InfoWorld's roundup of all the tips and trends programmers need to know in the Developers' Survival Guide. Download the PDF today! | For a quick, smart take on the news you'll be talking about, check out InfoWorld TechBrief -- subscribe today. ]
Learn to use R: Your hands-on guide
Because it's a programmable environment that uses command-line scripting, you can store a series of complex data-analysis steps in R. That lets you reuse your analysis work on similar data more easily than if you were using a point-and-click interface, notes Hadley Wickham, author of several popular R packages and chief scientist with RStudio.
That also makes it easier for others to validate research results and check your work for errors -- an issue that cropped up in the news recently after an Excel coding error was among several flaws found in an influential economics analysis report known as Reinhart/Rogoff.
The error itself wasn't a surprise, blogs Christopher Gandrud, who earned a doctorate in quantitative research methodology from the London School of Economics. "Despite our best efforts we always will" make errors, he notes. "The problem is that we often use tools and practices that make it difficult to find and correct our mistakes."
Sure, you can easily examine complex formulas on a spreadsheet. But it's not nearly as easy to run multiple data sets through spreadsheet formulas to check results as it is to put several data sets through a script, he explains.
Indeed, the mantra of "Make sure your work is reproducible!" is a common theme among R enthusiasts.

Who uses R? Relatively high-profile users of R include:

 
FacebookUsed by some within the company for tasks such as analyzing user behavior.
GoogleThere are more than 500 R users at Google, according to David Smith at Revolution Analytics, doing tasks such as making online advertising more effective.
National Weather ServiceFlood forecasts.
OrbitzStatistical analysis to suggest best hotels to promote to its users.
TruliaStatistical modeling.
Source: Revolution Analytics
Why not R? Well, R can appear daunting at first. That's often because R syntax is different from that of many other languages, not necessarily because it's any more difficult than others.
"I have written software professionally in perhaps a dozen programming languages, and the hardest language for me to learn has been R," writes consultant John D. Cook in a Web post about R programming for those coming from other languages. "The language is actually fairly simple, but it is unconventional."
And so, this guide. Our aim here isn't R mastery, but giving you a path to start using R for basic data work: Extracting key statistics out of a data set, exploring a data set with basic graphics and reshaping data to make it easier to analyze.
Your first step
To begin using R, head to r-project.org to download and install R for your desktop or laptop. It runs on Windows, OS X and "a wide variety of Unix platforms," but not yet on Androidor iOS.
Installing R is actually all you need to get started. However, I'd suggest also installing the free R integrated development environment (IDE) RStudio. It's got useful features you'd expect from a coding platform, such as syntax highlighting and tab for suggested code auto-completion. I also like its four-pane workspace, which better manages multiple R windows for typing commands, storing scripts, viewing command histories, viewing visualizations and more.
TITLE
Although you don't need the free RStudio IDE to get started, it makes working with R much easier.
cloud collaboration tools gears suites business
More than simply another collaboration solution, Slack has RESTful APIs that let you exchange data with
READ NOW
The top-left window is where you'll probably do most of your work. That's the R code editor allowing you to create a file with multiple lines of R code -- or open an existing file -- and then run the entire file or portions of it.
Bottom left is the interactive console where you can type in R statements one line at a time. Any lines of code that are run from the editor window also appear in the console.
The top right window shows your workspace, which includes a list of objects currently in memory. There's also a history tab with a list of your prior commands; what's handy there is that you can select one, some or all of those lines of code and one-click to send them either to the console or to whatever file is active in your code editor.
The window at bottom right shows a plot if you've created a data visualization with your R code. There's a history of previous plots and an option to export a plot to an image file or PDF. This window also shows external packages (R extensions) that are available on your system, files in your working directory and help files when called from the console.
Learning the shortcuts
Wickham, the RStudio chief scientist, says these are the three most important keyboard shortcuts in RStudio:
  • Tab is a generic auto-complete function. If you start typing in the console or editor and hit the Tab key, RStudio will suggest functions or file names; simply select the one you want and hit either Tab or enter to accept it.
  • Ctrl-up arrow (Cmd-up arrow on a Mac) is a similar auto-complete tool. Start typing and hit that key combination, and it shows you a list of every command you've typed starting with those keys. Select the one you want and hit Return. This works only in the interactive console, not in the code editor window.
  • Ctrl-Enter (Cmd-Enter on a Mac) takes the current line of code in the editor, sends it to the console and executes it. If you select multiple lines of code in the editor and then hit Ctrl/Cmd-Enter, all of them will run.
For more about RStudio features, including a full list of keyboard shortcuts, head to the online documentation.
Setting your working directory
cloud collaboration tools gears suites business
More than simply another collaboration solution, Slack has RESTful APIs that let you exchange data with
READ NOW
Change your working directory with the setwd() function, such as:
setwd("~/mydirectory")
Note that the slashes always have to be forward slashes, even if you're on a Windows system. For Windows, the command might look something like:
setwd("C:/Sharon/Documents/RProjects")
If you are using RStudio, you can also use the menu to change your working directory under Session > Set Working Directory.
Installing and using packages
Chances are if you're going to be doing, well, pretty much anything in R, you're going to want to take advantage of some of the thousands of add-on packages available for R at CRAN, the Comprehensive R Archive Network. The command for installing a package is:
install.packages("thepackagename")
If you don't want to type the command, in RStudio there's a Packages tab in the lower-right window; click that and you'll see a button to "Install Packages." (There's also a menu command; the location varies depending on your operating system.)
To see which packages are already installed on your system, type:
installed.packages()
Or in RStudio, go to the Packages tab in the lower-right window.
To use a package in your work once it's installed, load it with:
library("thepackagename")
If you'd like to make sure your packages stay up to date, you can run:
update.packages()
That way, you get the latest versions for all your installed packages.
If you no longer need or want a package on your system, use the function:
remove.packages("thepackagename")
Help!
If you want to find out more about a function, you can type a question mark followed by the function name -- one of the rare times parentheses are not required in R, like so:
?functionName
This is a shortcut to the help function, which does use parentheses:
help(functionName)
However, I'm not sure why you'd want to use this as opposed to the shorter ?functionName command.
If you already know what a function does and just want to see formats for using it properly, you can type:
example(functionName)
You'll get a list with examples of the function being used, if there's one available. The arguments (args) function just displays a list of a function's arguments:
args(functionName)
If you want to search through R's help documentation for a specific term, you can use:
help.search("your search term")
That also has a shortcut:
??("my search term")
No parentheses are needed if the search term is a single word without spaces.
See the entire beginner's guide to R:
This story, "Beginner's guide to R: Introduction" was originally published byComputerworld.

Sunday, April 10, 2016

Graphics Resources

Graphics Resources:
Blog for R& graphics:http://R-Bloggers.com
Hadley Wickham's web page:heep://ggplot2.org
Discussion list devoted to ggplot2:
http://groups.google.com/group/ggplot2

ggplot()

ggplot(mydata100, aes(pretest, posttest, shape=gender, linetype=gender))+geom_point(size=5)+geom_smooth(method="lm")

# Create your theme
my_white <-theme_bw()+theme(plot.title = element_text(size = rel(3)),
       panel.grid.major.x = element_blank(),
       panel.grid.minor.x = element_blank(),
       panel.grid.major.y = element_blank(),
       panel.grid.minor.y = element_blank())
 
# Plot!
ggplot(mydata100, aes(pretest, posttest, shape=gender,linetype=gender))+geom_point(size=2)+facet_grid(workshop~gender)+geom_smooth(method="lm")+labs(title="Combination Plot", x="Before Workshop", y="After Workshop")+my_white

Saturday, April 9, 2016

aes()

To include the Aesthetics component of the Grammar of Graphics, you add a call to the aes() function as a second argument. The aes() function has a list of name value pairs as argument whereon the basic aesthetics should map.

Wednesday, April 6, 2016

mutate()

The mutate() function only works on variables (columns) and not on observations (rows).

Tuesday, April 5, 2016

Logic Rules and Functions

Logic Operators:

Equals:               ==
Less than:          <
Greater than:      >
Less or equal:    <=
Greater or equal:>=
Not:                    !
Not equal:          !=
And:                   &
Or:                      |
%in%
Exclusive or:     xor(a, b)
0<=x<=1:           (x>=0)&(x<=1)
NA size:             Just missing

Friday, April 1, 2016

dplyr's select Function

# A data frame businesshours is pre-loaded in the workspace.

# Load the `dplyr` package into the memory.
library("dplyr")

# Use the `select()` function to select all variables starting with the variable "period" until "QR3" and all the variables in between them.
select(businesshours, period:QR3)

# Use the `select()` function to select all variables that contain "o".
select(businesshours, contains("o"))

# Use the `select()` function to select all variables that starts_with "Q".
select(businesshours, starts_with("Q"))

# Use the `select()` function to select all variables with a numeric range from 2 to 4 and starting with "QR".
select(businesshours, num_range("QR", 2:4))

# Use the `select()` function to select all variables that DO NOT have a numeric range from 2 to 4 and starts with "QR".
select(businesshours, -num_range("QR", 2:4))

Monday, March 28, 2016

Add-ons

R has over 8,500 add-on packages, many containing multiple procedures, so it can do most of the things that SAS and SPSS can do and quite a bit more. The list below focuses on SAS and SPSS products and which of them have counterparts in R. As a result, some categories are extremely broad (e.g. regression) while others are quite narrow (e.g. conjoint analysis). This table does not contain the hundreds of R packages that have no counterparts in the form of SAS or SPSS products. There are many important topics (e.g. mixed models, survival analysis) offered by all three that are not listed because neither SAS Institute nor IBM’s SPSS Company sell a product focused just on that.
Advanced Models
  • SAS/STAT
  • IBM SPSS Advanced Statistics
  • R itself, MASS, many others
Association Analysis
  • SAS Enterprise Miner
  • IBM SPSS Association
  • R: arules, arulesNBMiner, arulesSequences
Basic Statistics
  • Base SAS
  • IBM SPSS Statistics Base
  • R
Bootstrapping
  • SAS/STAT
  • IBM SPSS Bootstrapping
  • R: BootCL, BootPR, boot, bootRes, BootStepAIC, bootspecdens, bootstrap, FRB, gPdtest, meboot, multtest, pvclust, rqmcmb2, scaleboot, simpleboot
Classification Analysis
  • SAS Enterprise Miner
  • IBM SPSS Classification
  • rattle, see also: neural networks and trees
Conjoint Analysis
  • SAS/STAT: PROC TRANSREG
  • IBM SPSS Conjoint
  • R: homals, psychoR, bayesm
Correspondence Analysis
  • SAS/STAT::PROC CORRESP
  • IBM SPSS Categories
  • R: ade4, cocorresp, FactoMineR, homals (most like SPSS Categories), made4, MASS, psychoR, PTAk, vegan
Data Access
  • SAS/ACCESS
  • SPSS Data Access Pack
  • DBI, foreign, gdata::read.xls, Hmisc::sas.get, SAScii, sasxport.get, RODBC, sas7bdat (best choice for reading SAS files), WriteXLS, xlsReadWrite, XLconnect (best choice for Excel)
Data Collection
  • SAS/FSP
  • IBM SPSS Data Collection Family
  • R: none; MySQL or PostgreSQL are popular among R users for this purpose
Data Mining
  • SAS Enterprise Miner
  • IBM SPSS Modeler (formerly Clementine)
  • arules, FactoMineR, Rattle, Red-R, RWeka link to Weka, various functions
Data Mining, In-database Processing
  • SAS In-Database Initiative with Teradata
  • IBM SPSS Modeler
  • PL/R for PostgreSQL, RODM for Oracle
Data Preparation
  • SAS: Various procedures
  • IBM SPSS Data Preparation, various commands
  • R: these are specific to data error checking: assertr, deducorrect, ensurer (dprep is no longer being maintained); these are more general purpose: dplyr, plyr, reshape, reshape2, sqldf, tidyr, various functions
Developer Tools
  • SAS/AF, SAS/FSP, SAS Integration Technologies, SAS/TOOLKIT
  • IBM SPSS Statistics Developer, IBM SPSS Statistics Programmability Extension
  • R links to most popular compilers, scripting languages, and databases, StatET
Direct Marketing
  • SAS doesn’t have anything like this
  • IBM SPSS Direct Marketing
  • R doesn’t have anything like this
Exact Tests
  • SAS/STAT various procedures
  • IBM SPSS Exact Tests
  • R: coin, elrm, exact2x2, exactLoglinTest, exactmaxsel, and options in many others
Excel Integration
  • SAS Add-in for Microsoft Office, SAS Enterprise BI Server
  • SPSS: none (SPSS Advantage for Excel is discontinued)
  • RExcel
Forecasting
  • SAS/ETS
  • IBM SPSS Forecasting
  • Over 40 packages that do time series are described at the Task View link above under Time Series
Forecasting, Automated
  • SAS Forecast Server
  • IBM SPSS Forecasting
  • R: forecast
Genetics
  • SAS: JMP Genomics
  • SPSS: None
  • R: Bioconductor
Geographic Information Systems
  • SAS/GIS, SAS/GRAPH
  • SPSS Base
  • R: maps, mapdata, mapproj, GRASS via spgrass6, RColorBrewer, see Spatial in CRAN Task Views
Graphical user interfaces
  • SAS Enterprise Guide, IML Studio, SAS/ASSIST, Analyst, Insight
  • IBM SPSS Statistics Base
  • R: Menus & dialog boxes: Deducer, R Commander
    Data Mining: rattle, Red-R
Graphics, Interactive
  • SAS/IML Studio, SAS/INSIGHT, JMP
  • SPSS: none
  • R: cranvas, rggobi link to GGobi, iPlots, latticist, playwith, TeachingDemos
Graphics, Static
  • SAS/GRAPH
  • SPSS Base, Graphics Production Language
  • R: ggplot2, gplots, graphics, grid, gridBase, hexbin, lattice, plotrix, scatterplot3d, vcd, vioplot, geneplotter, Rgraphics
Graphics, Template Builder
  • Doesn’t use Grammar of Graphics model that forms the core of IBM SPSS Viz Designer or R’s ggplot2
  • IBM SPSS Viz Designer
  • R: Deducer::Plot Builder
Guided Analytics
  • SAS/LAB
  • SPSS: none
  • R: none
Internet Control
  • SAS/Intrnet
  • SPSS: none
  • R: CGIwithR, Rweb (see also Server Version below)
Matrix/linear Algebra
  • SAS/IML
  • IBM SPSS Matrix
  • R has many matrix functions built in, matlab, Matrix, sparseM
Missing Values Imputation
  • SAS/STAT::PROC MI
  • IBM SPSS Missing Values
  • R: arrayImpute, arrayMissPattern, Amelia, cat, Hmisc::aregImpute, Hmisc::fit.mult.impute, EMV, longitudinalData, mi, mice (similar to SPSS & SAS approach), mitools, mvnmle,
    SeqKnn, VIM (nice visualization)
Neural Networks
  • SAS Enterprise Miner
  • IBM SPSS Neural Networks, IBM SPSS Modeler
  • R: AMORE, grnnR, neuralnet, nnet, rattle
Operations Research
  • SAS/OR
  • SPSS: none
  • R: glpk, linprog, LowRankQP, TSP

Output Management – this isn’t actually an add-on product but it’s so important that I include it here.
  • SAS: Output Delivery System (ODS)
  • SPSS: Output Management System (OMS)
  • R: this is built into base R, but the dplyr package combined with the broom package makes saving output for further analysis much easier. The older plyr package is slightly more flexible, but much slower. The data.table package is the fastest, though less popular than dplyr.
Power Analysis
  • SAS Power and Sample Size Application, SAS/STAT::PROC POWER, PROC GLMPOWER
  • SPSS: SamplePower
  • R: asypow, powerpkg, pwr, MBESS
Quality Control
  • SAS/QC
  • IBM SPSS Statistics Base
  • R: qcc, spc
Regression Models
  • SAS/STAT
  • IBM SPSS Regression
  • R, Hmisc, lasso, VGAM, pda, rms (replaces Design)
Sampling, Complex
  • SAS/STAT: PROC SURVEY SELECT, SURVEYMEANS, etc.
  • IBM SPSS Complex Samples
  • R: pps, sampfling, sampling, spsurvey, survey
Segmentation Analysis
  • SAS Enterprise Miner
  • IBM Modeler Segmentation
  • R: cluster, rattle, som, see CRAN Task Views under Cluster for over 70 packages
Server Version
  • SAS, SAS Enterprise Miner for your server
  • IBM SPSS Statistics Server, IBM SPSS Modeler Server
  • R for your server, rapache, R(D)COM Server, Rserve, StatET
Structural Equation Modeling
  • SAS/STAT::PROC CALIS
  • SPSS: Amos
  • R: lavaan (can “mimic” Mplus or EQS output), OpenMX, sem
Tables
  • Base SAS, PROC REPORT, PROC SQL, PROC TABULATE, SAS Enterprise Reporter
  • IBM SPSS Custom Tables
  • For display, the compareGroups, tables and rreport packages are the most similar. The xtable package converts various tabular types of output to HTML or LaTeX. texreg does a wonderful job of comparing multiple models side-by-side. To create tables for use in further analysis (rather than for display): base::aggregate, Epi::stat.table, plyr, reshape2, base::tapply. The MRCV package handles Multiple Response Categorical Variables (“check all that apply” items on surveys.)
Text Analysis/Mining
  • SAS Enterprise Content Categorization, SAS Ontology Management, SAS Sentiment Analysis, SAS Text Miner
  • IBM SPSS Text Analytics, IBM SPSS Text Analysis for Surveys
  • R: corpora, emu, gsubfn, kernlab, KoNLP, koRpus, languageR, lsa, maxent, openNLP, openNLPmodels.en, openNLPmodels.es, RcmdrPlugin.TextMining, RKEA, RQDA, RTextTools, RWeka, Snowball, tautextcat, TextRegression, tm, tm.plugin.dc, tm.plugin.mail, topicmodels, wordcloud, wordnet, zipfR
Trees, Decision, Classification or Regression
  • SAS Enterprise Miner
  • IBM SPSS Decision Trees, IBM SPSS AnswerTree, IBM SPSS Modeler (formerly Clementine)
  • ada, adabag, BayesTree, boost, caret, GAMboost, gbev, gbm, maptree, mboost, mvpart, party, pinktoe, quantregForest, rpart,rpart.permutation, randomForest, rattle, tree
My thanks go out to the many people who helped compile this table including: Thomas E. Adams, Liviu Andronic, Jonathan Baron, Roger Bivand, Jason Burke, Patrick Burns, David L. Cassell, Dennis Fisher, Peter Flom, Tal Galili, Chao Gai, Bob Green, Frank E. Harrell Jr., Rob Hyndman, Robert I. Kobacoff, Max Kuhn, Paul Murrell, Yves Rosseel, Charilaos Skiadas, Greg Snow, Antony Unwin, Tobias Verbeke, Kyle Weeks, Graham Williams, and David Winsemius.
All SAS and SPSS product names are registered trademarks of their respective companies.
Copyright 2008, 2009, 2010, 2011, 2012, 2013 Robert A. Muenchen.