R is a statistical and data mining package consisting of a
programming language and a graphics system. It is used throughout this
book to illustrate how to do data mining. In the following sections
of this chapter we introduce the basics of R. Many examples are
provided and can be readily tried by yourself to facilitate
learning. You will also find many examples on the R-help mailing list
at https://stat.ethz.ch/mailman/listinfo/r-help. As an advocate
of learning by example, and motivated by the programming paradigm of
``programming by example'' (Cypher, 1993), my
intention is that you will be able to replicate the examples from the
book, and then fine tune them to suit your own needs.
R is a language and the basic modus operandi is to write sentences
expressed in that language. After a while you will want to do more
than to issue single, simple, commands (sentences), but to write
sentences and paragraphs and full novels in this language! R script
files (often with the R file name
extension) are the place to write script files. You can then re-run
your scripts to transform, at will and automatically, your source data
into information and knowledge.
This chapter begins with an overview of some of the key advantages
(and disadvantages) of using R and continues with a guide to
interacting with R. The recommended interface is through the
powerful Emacs editor, augmented with the ESS package, under either
GNU/Linux or MS/Windows. This is a personal preference and you may
prefer some of the alternatives we discuss.
There are graphical user interface (GUI) tools available for accessing
R, but they are not in general as gadget full as those provided by
commercial data mining tools. This is both a disadvantage and an
advantage! For R this results in a steeper learning curve, but once
into R, performing operations over the same or similar datasets
becomes very easy using its programming language interface.
Let's start with some of the advantages with using R:
- R is licensed under the GNU General Public License, with
Copyright held by The R Foundation for Statistical
Computing. Thus, it is Free Open Source Software, freely available,
so that anyone can freely download and install the software and even
freely modify the software, or to look at the code behind the
software to learn how things can be done. Indeed, anyone is welcome
to provide bug fixes, code enhancements, and new packages, and the
wealth of quality packages available for R is a testament to this
approach to software development.
- R probably has the most complete collection of statistical
functions of any statistical or data mining package. New technology
and ideas often appear first in R.
- The graphic capabilities of R are outstanding, providing a
fully programmable graphics language which surpasses most other
statistical and graphical packages.
- A very active email list, with some of the worlds leading
statisticians actively responding, is available for anyone to join.
Questions are quickly answered and the archive provides a wealth of
user solutions and examples. Be sure to read the
Posting Guide
first.
- Being open source the R source code is peer reviewed, and
anyone is welcome to review it and suggest improvements. Bugs are
fixed very quickly. Consequently, R is a rock solid product. New
packages provided with R do go through a life cycle, often
beginning as somewhat less quality tools, but usually quickly
evolving into top quality products.
- R plays well with many other tools, importing data, for
example, from CSV files, SAS, and SPSS, or directly from MS/Excel,
MS/Access, Oracle, and MySQL. It can also produce graphics output in
PDF, JPG, PNG, and SVG formats, and table output for LATEX and
HTML.
Whilst the advantages might flow from the pen with a great deal of
enthusiasm, it is useful to note some of the disadvantages or
weaknesses of R, even if they are perhaps transitory!
- R is not so easy to use for the novice. There is no simple to
use graphical user interface (GUI) that encompasses a nice data
processing view with point and click graphics as in commercial and
free offerings including Clementine (See Chapter 30),
SAS/Enterprise Miner (See Chapter
), and even Weka
(See Chapter 28).
- Documentation is sometimes patchy. Whilst there are extensive
documents on line and available in books and throughout the
Internet, it can sometimes be terse and even impenetrable to the
non-statistician. On the other hand, for example, SAS has extensive,
self-contained, and often well explained, documentation, readily
available to the user. Nonetheless, users do comment that the R
documentation is to the point and easy to consult.
- The quality of some packages is less than perfect, although if a
package is useful to many people, it will quickly evolve into a very
robust product through collaborative effort.
- There is no one to complain to if something doesn't work - at
least no one who has a financial interest in keeping you, the user,
as a satisfied customer. Organisations are quite happy to pay major
premiums for that apparent peace of mind! Nonetheless, problems are
usually dealt with quickly on the mailing list, and bugs disappear
with lightning speed.
The remaining sections of this chapter can generally be skipped on a
reading through the book, but provide a basic reference guide for
using R, and in particular, loading in data and manipulating data,
as well as some of its programming capabilities. While
chapter 2 deals in detail with creating data in R, we
introduce some of the basics here. The most basic needs include
creating simple datasets, and being familiar with the basic data types
and programming concepts, and how to get help.
Subsections
Copyright © 2004-2005
Brought to you by Togaware.