5 must have R programming tools

- janvier 13, 2019

R, along with Python, is one of the most popular tools for conducting data science. Propelled by a historically strong open-source developer community (R is about 25 years old — older than some data scientists), R is now strongly sought after by employers eyeing data scientists. Although R by itself is extremely powerful, there exist a few other (crucial) tools any R users should become familiar with. Now, in no particular order, we have:

1- RStudio

Most R users have probably heard of RStudio. It’s by far one of the most popular R tools in existence and you probably already have it. However, that doesn’t preclude it from inclusion here, because RStudio truly is a must-have. Conveniently, the user interface gives you four quadrants that are a necessity to working efficiently with R: (upper left) your current file, (upper right) your current workspace, which contains variables and other objects, (lower left) an R console and (lower right) a window for documentation, graphics and files. You can even access Git through RStudio.

RStudio is crucial as it allows you to be agile in the sense that you always know where you are, through viewing your current file or workspace and where you are going, through using the console for experimentation or the documentation viewer for understanding functions.

2- lintr

If you come from the world of Python, you’ve probably heard of linting. Essentially, linting analyzes your code for readability. It makes sure you don’t produce code that looks like this:

# This is some bad R code
if ( mean(x,na.rm=T)==1) { print(“This code is bad”); } # Still bad code because this line is SO long

There are many things wrong with this code. For starters, the code is too long. Nobody likes to read code with seemingly endless lines. There are also no spaces after the comma in the mean() function, or any spaces between the == operator. Oftentimes data science is done hastily, but linting your code is a good reminder for creating portable and understandable code. After all, if you can’t explain what you are doing or how you are doing it, your data science job is incomplete. lintr is an R package, growing in popularity, that allows you to lint your code. Once you install lintr, linting a file is as easy as lint("filename.R") .

3- Caret

Caret, which you can find on CRAN, is central to a data scientist’s toolbox in R. Caret allows one to quickly develop models, set cross-validation methods and analyze model performance all in one. Right out of the box, Caret abstracts the various interfaces to user-made algorithms and allows you to swiftly create models from averaged neural networks to boosted trees. It can even handle parallel processing. Some of the models caret includes are: AdaBoost, Decision Trees & Random Forests, Neural Networks, Stochastic Gradient Boosting, nearest neighbors, support vector machines — among the most commonly used machine learning algorithms.

4- Tidyverse

You may not have heard of tidyverse as a whole, but chances are, you’ve used one of the packages in it. Tidyverse is a set of unified packages meant to make data science… easyr (classic R pun). These packages alleviate many of the problems a data scientist may run into when dealing with data, such as loading data into your workspace, manipulating data, tidying data or visualizing data. Undoubtedly, these packages make dealing with data in R more efficient.

It’s incredibly easy to get Tidyverse, you just run install.packages("tidyverse") and you get:

ggplot2: A popular R package for creating graphics
dplyr: A popular R package for efficiently manipulating data
tidyr: An R package for tidying up data sets
readr: An R package for reading in data
purrr: An R package which extends R’s functional programming toolkit
tibble: An R package which introduces the tibble (tbl_df), an enhancement of the data frame
By and large, ggplot2 and dplyr are some of the most common packages in the R sphere today, and you’ll see countless posts on StackOverflow on how to use either package.

(Fine Print: Keep in mind, you can’t just load everything with library(tidyverse) you must load each individually!)

5- Jupyter Notebooks or R Notebooks

Data science MUST be transparent and reproducible. For this to happen, we have to see your code! The two most common ways to do this are through Jupyter Notebooks or R Notebooks.

Essentially, a notebook (of either kind) allows you to run R code block by block, and show output block my block. We can see on the left that we are summarizing the data, then checking the output. After, we plot the data, then view the plot. All of these actions take place within the notebook, and it makes analyzing both output and code a simultaneous process. This can help data scientists collaborate and ease the friction of having to open up someone’s code and understand what it does. Additionally, notebooks also make data science reproducible, which gives validity to whatever data science work you do!

Honorable Mention: Git
Last but not least, I want to mention Git. Git is a version control system. So why use it? Well, it’s in the name. Git allows you to keep versions of the code you are working on. It also allows multiple people to work on the same project and allows those changes to be attributed to certain contributors. You’ve probably heard of Github, undoubtedly one of the most popular git servers.

You can visit my website at www.peterxeno.com and my Github at www.github.com/peterxeno

Rechercher dans ce blog

News Store