Exploring Data Using Graphics And Visualization.
In this you will be using the churn data: churn_data.txt
Read data into a data frame using the function read.csv() with the following options:
Assume that you saved the file churn_data.txt in C:/Datasets folder. Then you can read file into a data frame as follows:
churnData=read.csv(file, stringsAsFactors = FALSE,header = TRUE)
A) Print the name of the columns.
Hint: colnames() function.
B) Print the number of rows and columns
C) Count the number calls per state.
Hint: table() function.
D) Find mean, median,standard deviation, and variance of nightly charges, the column Night.Charge in the data.
The R functions to be used are mean(), median(), sd(), var().
E) Find maximum and minimum values of international charges (Intl.Charge), customer service calls (CustServ.Calls), and daily charges(Day.Charge).
F) Use summary() function to print information about the distribution of the following features:
"Eve.Charge" "Night.Mins" "Night.Calls" "Night.Charge" "Intl.Mins" "Intl.Calls"
What are the min and max values printed by the summary() function for these features?
Check textbook page 34 for a sample.
G) Use unique() function to print the distinct values of the following columns:
State, Area.Code, and Churn.
H) Extract the subset of data for the churned customers(i.e., Churn=True). How many rows are in the subset?
Hint: Use subset() function. Check lecture notes and textbook for samples.
I) Extract the subset of data for customers that made more than 3 customer service calls(CustServ.Calls). How many rows are in the subset?
J) Extract the subset of churned customers with no international plan (Int.l,Plan) and no voice mail plan (VMail.Plan). How many rows are in the subset?
K) Extract the data for customers from California (i.e., State is CA) who did not churn but made more than 2 customer service calls.
L) What is the mean of customer service calls for the customers that did not churn (i.e., Churn=False)?
question2 related to above
In this ,we will explore the churn data using graphics and visualization. One of the primary reasons for performing exploratory data analysis (EDA) is to investigate the variables, examine the distributions of the categorical variables, look at the histograms of the numeric variables, and explore the relationships among sets of variables.
Although we are not going to develop any models for this project, in a real-world project our task is to identify patterns in the data that will help to reduce the proportion of churners.
We will use the same data set we had in Week 2 assignment:
Data file: churn_data.txt
All graphics in this assignment have to be plotted using ggplot2 library. So, you need to install ggplot2 library for graphs:
Before using any methods from the libraries, you need to load these libraries into the R code using
Here is how you can read data into a data frame named churnData:
churnData <- read.csv(filePath, stringsAsFactors = FALSE,header = TRUE)
where filePath is the location of the churn_data.txt file. For example, if you saved file in C:/tmp, then you should use C:/tmp/churn_data.txt
The variables in the file churn_data.txt are
State: Categorical, for the 50 states and the District of Columbia.
Account length: Integer-valued, how long account has been active.
Area code: Categorical
Phone number: Essentially a surrogate for customer ID.
International plan: Dichotomous categorical, yes or no.
Voice mail plan: Dichotomous categorical, yes or no.
Number of voice mail messages: Integer-valued.
Total day minutes: Continuous, minutes customer used service during the day.
Total day calls: Integer-valued.
Total day charge: Continuous, perhaps based on above two variables.
Total eve minutes: Continuous, minutes customer used service during the evening.
Total eve calls: Integer-valued.
Total eve charge: Continuous, perhaps based on above two variables.
Total night minutes: Continuous, minutes customer used service during the night.
Total night calls: Integer-valued.
Total night charge: Continuous, perhaps based on above two variables.
Total international minutes: Continuous, minutes customer used service to make
Total international calls: Integer-valued.
Total international charge: Continuous, perhaps based on above two variables.
Number of calls to customer service: Integer-valued.
Churn: Target. Indicator of whether the customer has left the company (true or false).
Part 1. Bar Charts
A bar chart is a histogram for discrete data: it records the frequency of every value of a categorical variable.
1.) Vertical Bar Charts
Plot the bar charts of State, Area.Code, Int.l.Plan, VMail.Plan, CustServ.Calls, and Churn.
Use the theme() function to change the text size, location, color, etc.. (An example is given in the textbook on page 61)
The following is the bar chart for State. As an example, the x-axis labels are bold, and rotated 90 degrees which can be set in the theme() function using
axis.text.x = element_text(face=”bold”,angle=90,vjust=0.5, size=11).
Similarly, the parameter colour=”#990000″ is used for the color of the x-axis title. So, the following options for axis.title.x and axis.text.x in theme() function display the title and text of x-axis as shown in the figure below:
axis.title.x = element_text(face=”bold”, colour=”#990000″, size=12), axis.text.x = element_text(face=”bold”,angle=90,vjust=0.5, size=11)
2.) Horizontal Bar Charts
Create the horizontal bar chart of CustServ.Calls.
Hint: Textbook page 49.
3.) Horizontal Bar Charts with Sorted Categories
Create horizontal bar chart where the number of calls are sorted for CustServ.Calls.
Hint: Textbook pages 50-51
Part 2: Histograms and Density Plots
The histogram and the density plot are two visualizations that help you quickly examine the distribution of a numerical variable.
A basic histogram bins a variable into fixed-width buckets and returns the number of data points that falls into each bucket. You can think of a density plot as a “continuous histogram” of a variable, except the area under the density plot is equal to 1.
1.) Plot the histograms of Account.Length, VMail.Message, Day.Mins, Intl.Calls, and VMail.Message.
Based on the histograms, comment on whether any of them have outliers, close to the Normal Distribution, multi-modal, or skewed.
The histogram for Account.Length is shown below:
2.) Plot the density plots of Account.Length, VMail.Message, Day.Mins, Intl.Calls, and VMail.Message.
Based on the density plots, comment on whether any of them have outliers, close to the Normal Distribution, multi-modal, or skewed.
As a sample, the density plot for VMail.Message is shown below:
Part 3. Scatter Plots
In addition to examining variables in isolation, you’ll often want to look at the relationship between two variables.
Plot the scatter plots for pairs Eve.Mins – Day.Mins, Day.Mins-Day.Charge, Eve.Mins-Eve.Charge, Day.Mins-Day.Calls.
Based on the plots, are there any relationships between the pair of features plotted?
The scatter plot of Eve.Mins vs Day.Mins is given below:
For the scatter plots in part A, add color to display churn and no-churn data points. Simply add aes(color=Churn) to the geom_point() function as shown below:
Part 4. Box Plots
A box-and-whiskers plot describes the distribution of a continuous variable by plotting its five-number summary: the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum.
Plot the box plots of CustServ.Calls, Night.Calls, and Intl.Charge by Churn.
Which of the features have outliers? (can you spot them in the box plot?)
What is the median of Night.Calls for customers that did not churn? (from the box plot)
The following is the box plot of CustServ.Calls.
Hint:You can find detailed information and samples of box plot at
Part 5. Dodged and Stacked Bar Charts
A) Display a dodged bar chart of Int.l.Plan by Churn.
Hint: Textbook pages 60-61.
B) Display a stacked bar chart of CustServ.Calls and Churn.