Gain clear insights into your data and solve real-world data science problems with R – from data munging to modeling and visualization
Preface
R has become the lingua franca of statistical analysis, and it's already actively and
heavily used in many industries besides the academic sector, where it originated
more than 20 years ago. Nowadays, more and more businesses are adopting R in
production, and it has become one of the most commonly used tools by data analysts
and scientists, providing easy access to thousands of user-contributed packages.
and scientists, providing easy access to thousands of user-contributed packages.
Mastering Data Analysis with R |
Mastering Data Analysis with R will help you get familiar with this open source
ecosystem and some statistical background as well, although with a minor focus
on mathematical questions. We will primarily focus on how to get things done
practically with R.
As data scientists spend most of their time fetching, cleaning, and restructuring data,
most of the first hands-on examples given here concentrate on loading data from
files, databases, and online sources.
Then, the book changes its focus to restructuring and cleansing data—still not performing actual data analysis yet. The later chapters describe special data types, and then classical statistical models are also covered, with some machine learning algorithms.
Then, the book changes its focus to restructuring and cleansing data—still not performing actual data analysis yet. The later chapters describe special data types, and then classical statistical models are also covered, with some machine learning algorithms.
What this book covers
Chapter 1, Hello, Data!, starts with the first very important task in every data-related
task: loading data from text files and databases. This chapter covers some problems
of loading larger amounts of data into R using improved CSV parsers, pre-filtering
data, and comparing support for various database backends.
Chapter 2, Getting Data from the Web, extends your knowledge on importing data with
packages designed to communicate with Web services and APIs, shows how to scrape
and extract data from home pages, and gives a general overview of dealing with XML
and JSON data formats.
Chapter 3, Filtering and Summarizing Data, continues with the basics of data processing
by introducing multiple methods and ways of filtering and aggregating data, with
a performance and syntax comparison of the deservedly popular data.table and
dplyr packages.
Chapter 4, Restructuring Data, covers more complex data transformations, such as
applying functions on subsets of a dataset, merging data, and transforming to and
from long and wide table formats, to perfectly fit your source data with your desired
data workflow.
Chapter 5, Building Models (authored by Renata Nemeth and Gergely Toth), is the first
chapter that deals with real statistical models, and it introduces the concepts of
regression and models in general. This short chapter explains how to test the
assumptions of a model and interpret the results via building a linear multivariate
regression model on a real-life dataset.
Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth),
builds on the previous chapter, but covers the problems of non-linear associations
of predictor variables and provides further examples on generalized linear models,
such as logistic and Poisson regression.
Chapter 7, Unstructured Data, introduces new data types. These might not include
any information in a structured way. Here, you learn how to use statistical methods
to process such unstructured data through some hands-on examples on text mining
algorithms, and visualize the results.
Chapter 8, Polishing Data, covers another common issue with raw data sources. Most
of the time, data scientists handle dirty-data problems, such as trying to cleanse data
from errors, outliers, and other anomalies. On the other hand, it's also very important
to impute or minimize the effects of missing values.
Chapter 9, From Big to Smaller Data, assumes that your data is already loaded, clean,
and transformed into the right format. Now you can start analyzing the usually high
number of variables, to which end we cover some statistical methods on dimension
reduction and other data transformations on continuous variables, such as principal
component analysis, factor analysis, and multidimensional scaling.
Chapter 10, Classification and Clustering, discusses several ways of grouping
observations in a sample using supervised and unsupervised statistical and machine
learning methods, such as hierarchical and k-means clustering, latent class models,
discriminant analysis, logistic regression and the k-nearest neighbors algorithm, and
classification and regression trees.
Chapter 11, A Social Network Analysis of the R Ecosystem, concentrates on a special data
structure and introduces the basic concept and visualization techniques of network
analysis, with a special focus on the igraph package
Chapter 12, Analyzing a Time Series, shows you how to handle time-date objects
and analyze related values by smoothing, seasonal decomposition, and ARIMA,
including some forecasting and outlier detection as well.
Chapter 13, Data around Us, covers another important dimension of data, with a
primary focus on visualizing spatial data with thematic, interactive, contour, and
Voronoi maps.
Chapter 14, Analyzing the R Community, provides a more complete case study that
combines many different methods from the previous chapters to highlight what you
have learned in this book and what kind of questions and problems you might face
in future projects.
Appendix, References, gives references to the used R packages and some further
suggested readings for each aforementioned chapter.
What you need for this book
All the code examples provided in this book should be run in the R console,
which needs to be installed on your computer. You can download the software
for free and find the installation instructions for all major operating systems at
http://r-project.org.
Although we will not cover advanced topics, such as how to use R in Integrated
Development Environments (IDE), there are awesome plugins and extensions
for Emacs, Eclipse, vi, and Notepad++, besides other editors. Also, we highly
recommend that you try RStudio, which is a free and open source IDE dedicated
to R, at https://www.rstudio.com/products/RStudio.
Besides a working R installation, we will also use some user-contributed R packages. These can easily be installed from the Comprehensive R Archive Network (CRAN) in most cases. The sources of the required packages and the versions used to produce the output in this book are listed in Appendix, References. To install a package from CRAN, you will need an Internet connection. To download the binary files or sources, use the install.packages command in the R console, like this:
Besides a working R installation, we will also use some user-contributed R packages. These can easily be installed from the Comprehensive R Archive Network (CRAN) in most cases. The sources of the required packages and the versions used to produce the output in this book are listed in Appendix, References. To install a package from CRAN, you will need an Internet connection. To download the binary files or sources, use the install.packages command in the R console, like this:
> install.packages('pander')
Some packages mentioned in this book are not (yet) available on CRAN, but may be
installed from Bitbucket or GitHub. These packages can be installed via the install_
bitbucket and the install_github functions from the devtools package. Windows
users should first install rtools from https://cran.r-project.org/bin/windows/
Rtools.
After installation, the package should be loaded to the current R session before
you can start using it. All the required packages are listed in the appendix, but the
code examples also include the related R command for each package at the first
occurrence in each chapter:
> library(pander)
We highly recommend downloading the code example files of this book (refer to the Downloading the example code section) so that you can easily copy and paste the commands in the R console without the R prompt shown in the printed version of the examples and output in the book.
If you have no experience with R, you should start with some free introductory articles and manuals from the R home page, and a short list of suggested materials is also available in the appendix of this book.
We highly recommend downloading the code example files of this book (refer to the Downloading the example code section) so that you can easily copy and paste the commands in the R console without the R prompt shown in the printed version of the examples and output in the book.
If you have no experience with R, you should start with some free introductory articles and manuals from the R home page, and a short list of suggested materials is also available in the appendix of this book.
Who this book is for
If you are a data scientist or an R developer who wants to explore and optimize
their use of R's advanced features and tools, then this is the book for you. Basic
knowledge of R is required, along with an understanding of database logic. If you
are a data scientist, engineer, or analyst who wants to explore and optimize your
use of R's advanced features, this is the book for you. Although a basic knowledge
of R is required, the book can get you up and running quickly by providing
references to introductory materials.
Download Link
No comments:
Post a Comment