by Scott McCoy

15 chapters, 559 pages, 235 illustrations

Published August 2021

ISBN 978-1-943872-76-3

15 chapters, 559 pages, 235 illustrations

Published August 2021

ISBN 978-1-943872-76-3

List price: $59.50

This book teaches your students how to use Python for data analysis. It starts by showing how to use Pandas for data analysis, Seaborn for data visualization, and JupyterLab as your IDE. It gives your students a thorough course in descriptive analysis and an introductory course in predictive analysis. And it ties all of the skills together by presenting 4 real-world case studies…political, environmental, social, and sports analytics. In short, no other book does anything like this.

The Canvas course file contains all the objectives, quizzes, assignments, and slides that you need to run an effective course. It only takes a few clicks to import it into the Canvas LMS. Then, you can customize it for your course. Learn more.

The text was perfect for my class. It provided a solid foundation for my students in using the Pandas and Seaborn libraries. I really appreciated the four case studies. They were a big help for my students as they illustrated all phases of data analysis and visualization.”

- About this Book
- Table of Contents
- Courseware
- FAQs
- Corrections

To present the essential Python and data analysis skills in a manageable progression and at the right pace, this book is divided into 4 sections.

Section 1 consists of 4 chapters that help you get your students started with data analysis as quickly and effectively as possible.

Here, they’ll learn how to use JupyterLab and Jupyter Notebooks to organize and develop their analyses. They’ll learn how to use a subset of the Pandas module for data analysis and visualization. And they’ll learn how to use the Seaborn module to create professional data visualizations that can be used for presentations.

By the end of chapter 4, they’ll be able to start doing analyses of their own.

The 5 chapters in section 2 present the critical skills needed for data analysis. That includes:

- How to read data into a Pandas DataFrame (depending on your course needs, you can cover getting data from CSV or Excel files, zip files, databases, Stata files, and JSON)
- How to clean the data by dropping unneeded rows and columns and fixing missing values, data types, and outliers
- How to prepare the data by adding columns, modifying the data in columns, and combining DataFrames
- How to analyze the data by grouping and aggregating the data, using pivot tables, and more
- How to analyze time-series data by reindexing, downsampling, and working with rolling windows and running totals

By the end of this section, your students will have a solid set of the descriptive analysis skills that are needed in a wide variety of fields.

Although a full treatment of predictive analysis is beyond the scope of any first course in data analysis, we believe that all students should at least understand the basic concepts. So that’s the goal of this two-chapter section.

First, chapter 10 shows how to find the correlations between variables, how to use Scikit-learn to work with simple linear regression models, and how to use Seaborn to create and plot various types of linear regression models.

Then, chapter 11 shows how to create and use multiple regression models, how to create and rescale dummy variables, and how to use Scikit-learn to not only select the right variables, but also the right *number* of variables for multiple regressions. These, of course, are the critical concepts and skills for doing an effective job of predictive analysis.

This section presents 4 case studies that show how the skills in this book can be applied to real-world datasets:

- the polling data for the 2016 presidential election
- the US Forest Service data for forest fires
- the US social survey data for hundreds of polls
- the basketball shot location data for NBA player Stephen Curry

Frankly, you can’t master data analysis by working with toy datasets, and these case studies help ensure that your students will master data analysis at a professional level.

The book assumes that the students have some programming experience, the kind they would get from any introduction to programming course. Then, chapter 1 presents the minimal set of Python skills that are required for this book: how to import modules; how to call and chain methods; how to code lists, slices, tuples, and dictionaries; and how to continue statements over two lines. For the times when your students need to know more than that, they can use Murach’s Python Programming as a reference.

The only software that’s needed for this book is the Anaconda distribution of Python. It includes JupyterLab, Pandas, Seaborn, Scikit-learn, and more.

Appendixes A and B show how to download and install this distribution on both Windows and macOS systems. Then, chapter 1 shows how to get started with JupyterLab.

As we see it, this is the best primary text for any course in which the focus is on the use of Python for data analysis. But it is also the ideal supplementary text for a general course on data analysis because it shows how to use Python to apply the concepts and statistical methods to real-world datasets.

Like all of our books, this one has features that you won’t find in any competing book. Here are just three of them:

- All the material is presented in our unique paired-pages format. That means that each topic is presented in a two-page spread: the examples and reference material are on the righthand page, and the explanation is on the left. This is the ideal format for today’s fast-paced world…and students love it!
- This paired-pages format also helps your students do their assignments and prepare for tests by making it easy for them to (1) review what they’ve learned and (2) look up the details on how to apply their new skills.
- This book presents 4 full-fledged case studies in section 4 and uses them for illustration throughout the book. It also uses 4 other analyses to provide other examples whenever they’re needed. Frankly, you can’t learn real-world skills by working with toy applications, and no other book has complete analyses like ours.

“This is my first exposure to Murach’s books, and I love them. I like the organization of the content, the consistent approach in each book, and the accuracy of the material.”

—Bob L., Michigan

“I really like the paired-pages format of detailed information on the left and quick notes on the right. This helps me to quickly find the information I’m looking for.”

—Roxanne T., Student, Washington

“I can’t praise this book highly enough. The clarity used in picking what to include, when to introduce it, and how to do so is remarkable.”

—Charles Ferguson, Software Developer, Australia

“Another thing I like is the exercises at the end of each chapter. They’re a great way to reinforce the main points of each chapter and force you to get your hands dirty.”

—Hien Luu, SD Forum/Java SIG

“Your book was indispensable to me. The answers were right there at every turn. All the examples made sense, and they all worked!”

—Alan Vogt, ETL Consultant, Massachusetts

“This book covers the perfect amount of description, and it does not make you bored by providing unnecessary details.”

—Posted at an online bookseller

On *Murach’s Python Programming: *“This is now my third book for Python, and it is the ONLY one that has made me feel comfortable solving problems and reading code. The paired pages approach is fantastic, and it makes learning the syntax, rules, and conventions understandable for me.”

—Posted at an online bookseller

“Your books shine out from the rest—the quality of writing and presentation of information is topnotch, and the consistency of quality across books is impressive.”

—Nolan Tamashiro, Developer

View the table of contents for this book in a PDF: Table of Contents (PDF)

*Click on any chapter title to display or hide its content.*

What data analysis is

The five phases of data analysis and visualization

The IDEs for Python data analysis

How to install and import the Python modules for data analysis

How to call and chain methods

The coding basics for Python data analysis

How to start JupyterLab and work with a Notebook

How to edit and run the cells in a Notebook

How to use the Tab completion and tooltip features

How syntax and runtime errors work

How to use Markdown language

How to get reference information

How to split the screen between two Notebooks

How to use Magic Commands

The Polling case study

The Forest Fires case study

The Social Survey case study

The Sports Analytics case study

The DataFrame structure

Two ways to get data into a DataFrame

How to save and restore a DataFrame

How to display the data in a DataFrame

How to use the attributes of a DataFrame

How to use the info(), nunique(), and describe() methods

How to access columns

How to access rows

How to access a subset of rows and columns

Another way to access a subset of rows and columns

How to sort the data

How to use the statistical methods

How to use Python for column arithmetic

How to modify the string data in columns

How to use indexes

How to pivot the data

How to melt the data

How to group the data

How to aggregate the data

How to plot the data

The Python libraries for data visualization

Long vs. wide data for data visualization

How the Pandas plot() method works by default

The three basic parameters for the Pandas plot() method

How to create a line plot or an area plot

How to create a scatter plot

How to create a bar plot

How to create a histogram or a density plot

How to create a box plot or a pie plot

How to improve the appearance of a plot

How to work with subplots

How to use chaining to get the plots you want

The Seaborn methods for plotting

The general methods vs. the specific methods

How to use the basic Seaborn parameters

How to use the Seaborn parameters for working with subplots

How to set the title, x label, and y label

How to set the ticks, x limits, and y limits

How to set the background style

How to work with subplots

How to save a plot

How to create a line plot

How to create a scatter plot

How to create a bar plot

How to create a box plot

How to create a histogram

How to create a KDE or ECDF plot

How to enhance a distribution plot

How to use other Axes methods to enhance a plot

How to annotate a plot

How to set the color palette

How to enhance a plot that has subplots

How to customize the titles for subplots

How to set the size of a specific plot

Common data sources

How to find and select the data that you want

How to import data directly into a DataFrame

How to download a file to disk before importing it

How to work with a zip file on disk

How to run queries against a database

How to use a SQL query to import data into a DataFrame

How to get and explore the metadata of a Stata file

How to build DataFrames for the metadata and the data

How to download a JSON file to disk

How to open a JSON file in JupyterLab

How to drill down into the data

How to build a DataFrame for the data

A general plan for cleaning the data

What the info() method can tell you

What the unique values can tell you

What the value counts can tell you

How to drop rows based on conditions

How to drop duplicate rows

How to drop columns

How to rename columns

How to find missing values

How to drop rows with missing values

How to fill missing values

How to find dates and numbers that are imported as objects

How to convert date and time strings to the datetime data type

How to convert object columns to numeric data types

How to work with the category data type

How to replace invalid values and convert a column’s data type

How to fix data problems when you import the data

How to find outliers

How to fix outliers

How to work with datetime columns

How to work with string columns

How to work with numeric columns

How to add a summary column to a DataFrame

How to apply functions to rows or columns

How to apply user-defined functions

How lambda expressions work with DataFrames

How to apply lambda expressions

How to set and remove an index

How to unstack indexed data

How to join DataFrames with an inner join

How to join DataFrames with a left or outer join

How to merge DataFrames

How to concatenate DataFrames

What the warning is telling you

What to do when the warning is displayed

What to watch for when the warning isn’t displayed

How to melt columns to create long data

How to plot melted columns

How to group and apply a single aggregate method

How to work with a DataFrameGroupBy object

How to apply multiple aggregate methods

How to use the pivot() method

How to use the pivot_table() method

How to create bins of equal size

How to create bins with equal numbers of values

How to plot binned data

How to select the rows with the largest values

How to calculate the percent change

How to rank rows

How to find other methods for analysis

How to generate time periods

How to reindex with datetime indexes

How to reindex with a semi-month index

How a user-defined function can improve a datetime index

How reindexing with an improved index can improve plots

How to use the resample() method

How to use the label and closed parameters when you downsample

How downsampling can improve plots

The concept of rolling windows

How to create rolling windows

How to plot rolling window data

How to create running totals

How to plot running totals

Types of predictive models

Introduction to regression analysis

The Housing dataset

How to identify correlations with a scatter plot

How to identify correlations with a grid of scatter plots

How to identify correlations with r-values

How to identify correlations with a heatmap

A procedure for creating and using a regression model

The function and methods for linear regression models

How to create, validate, and use a linear regression model

How to plot the predicted data

How to plot the residuals

The lmplot() method and some of its parameters

How to plot a simple linear regression

How to plot a logistic regression

How to plot a polynomial regression

How to plot a lowess regression

How to use the residplot() method to plot the residuals

The Cars dataset

How to create a simple regression model

How to plot the residuals of a simple regression

How to create a multiple regression model

How to plot the residuals of a multiple regression

How to identify categorical variables

How to review categorical variables

How to create dummy variables

How to rescale the data and check the correlations

How to create a multiple regression that includes dummy variables

How to select the independent variables

How to test different combinations of variables

How to use Scikit-learn to select the variables

How to select the right number of variables

Import the modules that you will need

Get the data

Display the data

Examine the data

Drop columns and rows

Rename columns

Fix object types

Fix data

Take an early plot with Pandas

Save the DataFrame

Add columns for grouping and filtering

Create a new DataFrame in long form

Take an early plot of the long data with Seaborn

Add monthly bins to the DataFrame

Add an average percent column for each month

Save the wide and long DataFrames

Plot the national and swing state polls

Plot the voter types

Plot the last two months of polling

Plot the gap changes in selected states

Prepare the gap data for the last week of polling

Plot the gap data for the last week of polling

Prepare the weekly gap data for the swing states

Plot the weekly gap data for the swing states

Download and unzip the SQLite database

Connect and query the database

Import the data into a DataFrame

Examine the data

Improve the readability of the data

Drop unnecessary rows

Drop duplicate rows

Convert dates to datetime objects

Check for missing contain dates

Add fire_month and days_burning columns

Examine the contain_date and days_burning columns

Analyze the data for California

Two more plots for California fires

Rank the states by total acres burned

Prepare a DataFrame for total acres burned by year within state

Prepare a DataFrame for the top 4 states

Plot the acres burned total by year for the top 4 states

Review the 20 largest fires in California

Use GeoPandas to plot the California map

Use GeoPandas or Seaborn to plot the California fires on a map

Plot the fires in the continental United States

Download and unzip the zip file for the data

Build a DataFrame for the metadata

Use the codebook and read the data that you want

Prepare the data

Plot the data and reduce the number of categories

Plot the total counts of the responses

Convert the counts to percents and plot them

Search the codebook for small question sets

Read and review the work-life data

Plot the responses for the first question

Plot the responses for the second and third questions

Use the codebook to find related columns

Use the codebook to find follow-up questions

Select the columns for an expanded DataFrame

Bin the data for a column

Develop and test a first hypothesis

Develop and test a second hypothesis

Develop and test a third hypothesis

Get the data

Build the DataFrame

Locate and drop unneeded rows

Locate and drop unneeded columns

Convert the game_date column to datetime data

Add a column for the season

Add a column for the shot result

Add a column for points made for each shot

Add three summary columns

Plot the points per game by season

Plot the averages of shots, shots made, and points per game by season

Plot the shot locations for two games

Plot the shot locations for two seasons

Plot the shot density for one season

Plot the shot density for two seasons

How to install Anaconda

How to use the Anaconda Prompt

How to use the Anaconda Navigator

How to install the files for this book

How to make sure Anaconda is installed correctly

How to download the large data files for this book

How to install Anaconda

How to run conda commands

How to use the Anaconda Navigator

How to install the files for this book

How to make sure Anaconda is installed correctly

How to download the large data files for this book

In contrast to other college publishers, we don’t fill dozens of pages in our books with end-of-chapter activities that may never be used. Instead, we provide everything you need for an effective course in a download from our instructor’s website. Then, you decide which of these materials are right for your course.

Here's a summary of the instructor's materials for this book. For a detailed description in PDF, please read the Instructor's Summary.

- In the EOC activities for all of our books, you’ll find carefully designed exercises that (1) let your students practice what they’ve just learned and (2) help them apply what they’ve learned in new ways.
- Since our exercises start from Jupyter Notebooks, your students can focus on new skills and not waste time on the repetitive code that’s the same in all analyses.
- Students can download the solutions to the EOC exercises (as well as the exercise starts) for free from our retail website. We started providing the solutions for the professionals who use our books for self-training. But we’ve found that they keep students from giving up when they get stuck on a problem at midnight, and that the model solutions also help them refine their future work.
But don’t worry! We provide additional projects and case studies that you can use for testing, and those solutions are available only to instructors (see below).

- Taken together, this unique system allows students to practice more…and learn more!…in much less time.

- Today, most textbooks include objectives, but they are often so poorly conceived that they are ignored by both students and instructors. In contrast, we provide objectives that describe the skills that the students should master, and mastery can be measured by the test banks, projects, and case studies that we provide. As a result, our objectives actually do facilitate learning.

- To test comprehension, we provide test banks in multiple formats, including Blackboard (which can be imported into Canvas and D2L Brightspace) and RTF (Word).
- Each test bank provides questions that are designed to test the skills described by the objectives for that chapter, and each test question is designed to test the skill described by one objective. This keeps the promise to the students that they will only be expected to have the skills that are described by the objectives.
- In the test banks, we use only multiple-choice test questions because they’re easy to score and they have the highest validity. That means the students with the best knowledge and skills will get the best scores. In contrast, matching and true/false questions have low validity, so we don’t use them.

- For each chapter, we also provide one or more projects that your students can do to practice the skills of that chapter. You can also use some of these projects as tests in computer lab because your students should be able to finish them in an hour or two. That of course is the best way to test whether your students have the skills described by the objectives of the chapter.

- We also provide case studies for this book that require the skills of several chapters. Assigning these case studies is another way to make sure that your students have mastered the skills of the chapters. You’ll also find that you can easily modify these case studies if you want to make them more or less difficult.

- In our books, the figures on the righthand pages present all of the critical information, including screen shots, diagrams, tables, examples, and code. Then, we build our PowerPoint slides from the figures, which means that our slides let you review everything that’s presented in the book. This makes it easy to answer any questions that your students raise or review any skills that your students are having trouble with.
- The slides for each chapter start with the chapter objectives. That helps your students stay focused on what they’ll learn and be tested on.

- Our instructor’s materials also include the starting Notebooks and solutions for the projects and case studies. So what you end up with is a complete package for a powerful data analysis course.

On this page, we’ll be posting answers to the questions that come up most often about this book. So if you have any questions that you haven’t found answered here at our site, please email us. Thanks!

To view the corrections for this book in a PDF, just click on this link: View the corrections

Then, if you find any other errors, please email us so we can correct them in the next printing of the book. Thank you!

This is our site for college instructors. To buy Murach books, please visit our retail site.