One of the most common questions I get from data science students is “should I learn R or Python?”
In the last year or so, many of the blog posts at the Sharp Sight blog have been Python data science tutorials, and I’ve actually been using Python a lot myself.
Đang xem: C# programming language
But recently, since I began the R data analysis series for covid19 data, I started to remember just how much I love analyzing data with R, and why.
Why I love R, and why you should learn it
The fact is, even though Python is also excellent in particular areas, I still really love R.
R particularly shines in a few key areas. I want to discuss those somewhat, so you understand where R is strong and why you should learn it.
Areas where R is the best
There are really three key areas where R shines:
data wranglingdata visualizationdata analysisdata wrangling
Let’s start with data manipulation.
For data manipulation, R is slightly better than Python. Not by much, but a little.
R’s dplyr package and related packages (like tidyr, lubridate, stringr, forcats, etc) make data manipulation extremely easy in R.
Everything “just works”.
The functions are well named, so you can remember the names of the functions. All of the functions are also highly modular. They do one thing and one thing only. So when you need to subset your rows, there’s a simple function to use (the dplyr filter function). When you want to subset your columns, there’s another simple function (the select function). Etcetera.
In R and the Tidyverse, everything is highly modular and everything fits together like little building blocks.
Honestly, data wrangling is just so easy with dplyr.
Python’s Pandas package is also good, and almost as good as dplyr. I think the thing that’s really lacking is that Pandas sometimes lacks helper functions that do very specific data manipulation tasks. So sometimes, you need to figure out a workaround for something very specific.
Another way of saying this is that dplyr and R’s tidyverse have a function for 99% of the data manipulation tasks that you’ll need to do, whereas Pandas has a function for 95%.
Pandas is great, but R, dplyr, and the Tidyverse have a slight edge for data manipulation.
For data visualization (at least, static data visualizations) R’s ggplot2 is substantially better than anything for Python. Matplotlib is powerful, but the syntax is complicated and hard to use. Seaborn is easier to use, but it lacks the polish of ggplot2.
When I visualize data with ggplot2, everything “just works.
Let me give you an example.
Recently for our R covid19 series, I created a small multiple chart:
Long time readers at the Sharp Sight blog know that I really love small multiple charts. As a data scientist, the small multiple chart should be one of the most powerful data visualization techniques in your toolkit.
But the fact is, small multiple charts are hard to create in most languages.
In Python, you can technically create one with Matplotlib or Seaborn.
But having said that, small multiples are a little harder to create with Seaborn. Can you create one? Yes. But formatting small multiple charts is a little cumbersome with Seaborn. Some of the code to format small multiples in Seaborn is a little buggy, so you have to resort to for-loops that manually modify low level chart properties. Frankly, it’s a pain in the ass.
And god … don’t get me started on matplotlib. Matplotlib is so much more complicated compared to modern data visualization tools.
One of the motto’s for Matplotlib is that “Matplotlib makes easy things easy and hard things possible.”
With due respect to the creator’s of Matplotlib, that’s kind of BS.
Matplotlib makes hard things possible. True. You can create almost any visualization with Matplotlib. But it’s always hard, compared to ggplot2.
In matplotlib, the easy things are kind of hard. The hard things are really hard. If you want to spend your afternoon writing for-loops to visualize things that would take you 5 minutes in ggplot2, go right ahead.
The fact is, compared to almost anything in Python, ggplot2 is just. so. easy.
Dare I say, ggplot2 is a joy to use. (Ok. I said it.)
That’s not to say that ggplot2 is never a pain in the ass. It is. Sometimes. But usually, the hard things about ggplot2 are polishing a visualization up. The last 5 to 10 percent where you’re trying to take a chart from “rough draft” to “perfect” is the hard part. With ggplot2, the challenge is more about the process than the programming.
I’ll say it again: without question, my favorite data visualization toolkit is ggplot2, and to use it properly, you need to use R.
Data analysis is important. Very important. Quite possibly, the most important data science skill.
It’s a little more complicated than that, because data analysis ultimately breaks down to data wrangling and data visualization. That is, data analysis is mostly data wrangling and data visualization, applied in particular ways with a particular process, and particular objectives. (So if data analysis is really important, you really need to learn data wrangling an data visualization first.)
Setting aside the nuance, I need to emphasize how important data analysis is to data science or “data analytics” more broadly.
No matter what project you’re working on, you need to analyze your data.
Doing machine learning? Great. Before you get started, you need to explore your data with exploratory data analysis.
Do you have a machine learning model that you’ve finished training? Ok then. You need to analyze the performance of the model and use analysis techniques to diagnose possible problems.
The same thing for “finding insights” When you look at most job descriptions for data-related jobs, you’ll find that they almost always use the term “find insights.” Recruiters and hiring managers want you to be able to “find insights in data.”
What does that mean?
They want you to be able to analyze data and find things that will increase profits, cash flows, and shareholder value. At the end of the day, they’re really just talking about using data analysis to drive profitability and financial metrics.
Remember what I said though: data analysis is mostly just an application of data wrangling and data visualization, with a particular process.
To analyze data (i.e., to “find insights”), you need to subset, aggregate, and compute summary statistics. You need to use data wrangling.
You also need to “see” the important things. You need to almost literally “see” the valuable things in the data. (And you need to be able to communicate those findings to others.)
To see those insights yourself, you need to use data visualization. To communicate them to others, you commonly use visualizations like bar charts, line charts, scatterplots, and other visualizations.
What I’m emphasizing is that data analysis – whether for a machine learning project or data exploration, or reporting – is important, and it’s really just about using data wrangling and data visualization in particular ways.
But recall what I wrote earlier in this blog post: for data wrangling, R is slightly better than Python. And for data visualization, R’s ggplot2 is quite a bit better than Python.
What I’m driving at, is that because R is better at data wrangling and data visualization, it is also a superior tool for data analysis.
In sum: if I personally need to analyze a new dataset, I’d rather do it with R than Python.
Where R is not the best
As I just discussed, R is great a lot of things. It’s almost certainly the best for producing static, non-interactive data visualizations. It’s marginally better than Python at data manipulation. And because of these things, R is arguably the best programming language for data analysis as well.
R is arguably the best programming language for data analysis. #data #rstats via