Customer Service 1-800-221-5528

Murach’s Python for Data Science (2nd Edition)

by Scott McCoy
15 chapters, 588 pages, 240 illustrations
Published April 2024
ISBN 978-1-943873-17-3
List price: $59.50

Murach’s Python for Data Science starts by covering everything your students need to hit the ground running when using Python for data science. First, it presents a crash course in using the Pandas and Seaborn libraries for data analysis and visualization. Then, it presents a thorough course in data analysis, including how to use the Scikit-learn library to create statistical models that make predictions. Finally, it presents four real-world case studies that tie all the coursework together.

Now available as a Canvas course!

The Canvas course file contains all the objectives, quizzes, assignments, and slides that you need to run an effective course. It only takes a few clicks to import it into the Canvas LMS. Then, you can customize it for your course when it is available. Learn more.

The text was perfect for my class. It provided a solid foundation for my students in using the Pandas and Seaborn libraries. I really appreciated the four case studies. They were a big help for my students as they illustrated all phases of data analysis and visualization.”

J. Jasperson – Texas A&M University

  • About this Book
  • Table of Contents
  • Courseware
  • FAQs
  • Corrections

Book description

To present the essential Python skills for data science in a manageable progression and at the right pace, this book is divided into 4 sections.

Section 1: Get off to a fast start

This section gets your students started fast. First, they’ll learn how to use JupyterLab and Notebooks to organize and work with Python code for data science. Then, they’ll learn how to use the Pandas and Seaborn libraries for data analysis and visualization. By the end of this section, they’ll be able to start doing analyses of their own.

Section 2: The critical skills for success on the job

This section presents the descriptive analysis skills that are critical for success on the job. That includes how to:

  • Get data from CSV files, Excel files, JSON files, Stata files, and databases
  • Clean data by dropping unneeded rows and columns and fixing missing values, data types, and outliers
  • Prepare data by adding columns, modifying the data in columns, and combining data frames
  • Analyze data by grouping and aggregating the data, using pivot tables, and more
  • Analyze time-series data by reindexing, downsampling, and working with rolling windows and running totals

Section 3: An introduction to predictive analysis

This section presents the predictive analysis skills that your students need to create statistical models that make predictions. Although predictive analysis is a large topic that could be an entire course of its own, this section presents the concepts your students need to get started with it. More specifically, it shows your students how to use the Scikit-learn library to create linear regression models to predict numeric values.

Section 4: The case studies

This section presents four complete analyses that show how the skills in this book can be applied to real-world datasets:

  • Polling data for the 2016 presidential election
  • Wildfire data from the US Forest Service
  • US social survey data
  • Basketball shot data from the NBA (National Basketball Association)

These in-depth analyses make sure that your students master the professional skills they’re going to need.

The course prerequisites

The book assumes that the students have some programming experience, the kind they would get from any introduction to programming course. Then, chapter 1 presents the Python skills that are required for this book. If your students need to know more than that, they can refer to Murach’s Python Programming (2nd Edition).

What software your students need

The only software that’s needed for this book is the Anaconda distribution of Python. It includes JupyterLab, Pandas, Seaborn, Scikit-learn, and more.

Appendixes A and B show how to download and install this distribution on both Windows and macOS systems. Then, chapter 1 shows how to get started with JupyterLab.

What courses this book can be used for

This book is an optimal primary text for any course on the use of Python for data science. But it also works well as a supplementary text for any course where students need to use Python to analyze large datasets.

Why your students will learn faster and better with our book

Like all our books, this book is designed to make it as easy as possible for your students to learn new skills faster and retain them better. Here are a few of those features:

  • All of the information is presented in paired pages, with the essential syntax, guidelines, and examples on the right page and clear explanations on the left page.
  • The paired-pages format is ideal for reference when your students need to refresh their memories about how to do something.
  • The four analyses presented in section 4 use real-world datasets.
  • The hundreds of short examples present usable code for tasks that your students are likely to need for their own analyses.
  • The exercises at the end of each chapter provide a way for your students to gain valuable hands-on experience without any extra busywork.

What's new in this edition

  • We changed the title from Data Analysis to Data Science because we think the new title better reflects the content of the book.
  • We updated this book to the latest data science libraries including Pandas 2.2, Seaborn 0.13.2, and Scikit-learn 1.41. This resulted in many minor code changes to fix errors and silence deprecation warnings.
  • We updated appendixes A (Windows) and B (macOS) to show how to use Anaconda to create an environment that uses the exact versions of the libraries that we used in this book. If your students create and use this environment, you can be sure that all of the code in this book will work as described, even if new versions of the libraries become available.
  • We now provide all of the data files for this book from murach.com. As a result, you can be sure that these files will be available to your students.

A summary of some minor changes

  • We dropped coverage of the inplace parameter because and consistently use assignment instead.
  • Seaborn plots no longer accept the ci parameter, so we now show how to use the errorbar parameter instead.
  • In figure 6-10 of chapter 6, we now show how to use the ffill() method instead of the fillna() method.
  • Pandas no longer automatically drops categorical columns when you call numeric aggregate functions. As a result, we now select the numeric data before calling aggregate functions.
  • Due to changes in the way Pandas groups data, we no longer drop unused categories when grouping by a categorical column, and we explain how the observed parameter works.

What people said about the first edition

“I really appreciated the four case studies. They were a big help for my students as they illustrated all phases of data analysis and visualization.”
— J. Jasperson – Texas A&M University

“In his first at-bat, Scott McCoy smashes this one out of the park! This book is not just informative, it is exciting.”
— Scott Spurlock, Software Engineer, Georgia

“Unlike some other books on data analysis with Python, the explanations of how to perform data analysis are thorough rather than terse or with no explanations.”
— Posted at an online bookseller

“This is a fun book for beginners and experienced data scientists.”
— Posted at an online bookseller

What people say about Murach books

“This is my first exposure to Murach’s books, and I love them. I like the organization of the content, the consistent approach in each book, and the accuracy of the material.”
—Bob L., Michigan

“I really like the paired-pages format of detailed information on the left and quick notes on the right. This helps me to quickly find the information I’m looking for.”
—Roxanne T., Student, Washington

“I can’t praise this book highly enough. The clarity used in picking what to include, when to introduce it, and how to do so is remarkable.”
—Charles Ferguson, Software Developer, Australia

“Another thing I like is the exercises at the end of each chapter. They’re a great way to reinforce the main points of each chapter and force you to get your hands dirty.”
—Hien Luu, SD Forum/Java SIG

“Your book was indispensable to me. The answers were right there at every turn. All the examples made sense, and they all worked!”
—Alan Vogt, ETL Consultant, Massachusetts

“This book covers the perfect amount of description, and it does not make you bored by providing unnecessary details.”
—Posted at an online bookseller

On Murach’s Python Programming: “This is now my third book for Python, and it is the ONLY one that has made me feel comfortable solving problems and reading code. The paired pages approach is fantastic, and it makes learning the syntax, rules, and conventions understandable for me.”
—Posted at an online bookseller

“Your books shine out from the rest—the quality of writing and presentation of information is topnotch, and the consistency of quality across books is impressive.”
—Nolan Tamashiro, Developer

View the table of contents for this book in a PDF: Table of Contents (PDF)

Click on any chapter title to display or hide its content.

Section 1 Get off to a fast start

Chapter 1 Introduction to Python for data science

Introduction to data science

What data science is

The five phases of data analysis and visualization

The IDEs for Python data science

The Python skills that you need for data science

How to install and import the Python modules for data science

How to call and chain methods

The coding basics for Python data science

How to use JupyterLab as your IDE

How to start JupyterLab and work with a Notebook

How to edit and run the cells in a Notebook

How to use the Tab completion and tooltip features

How syntax and runtime errors work

How to use Markdown language

How to get reference information

Two more skills for working with JupyterLab

How to split the screen between two Notebooks

How to use Magic Commands

Introduction to the case studies

The Polling case study

The Forest Fires case study

The Social Survey case study

The Sports Analytics case study

Chapter 2 The Pandas essentials for data analysis

Introduction to the Pandas DataFrame

The DataFrame structure

Two ways to get data into a DataFrame

How to save and restore a DataFrame

How to examine the data

How to display the data in a DataFrame

How to use the attributes of a DataFrame

How to use the info(), nunique(), and describe() methods

How to access the columns and rows

How to access columns

How to access rows

How to access a subset of rows and columns

Another way to access a subset of rows and columns

How to work with the data

How to sort the data

How to use the statistical methods

How to use Python for column arithmetic

How to modify the string data in columns

How to shape the data

How to use indexes

How to pivot the data

How to melt the data

How to analyze the data

How to group the data

How to aggregate the data

How to plot the data

Chapter 3 The Pandas essentials for data visualization

Introduction to data visualization

The Python libraries for data visualization

Long vs. wide data for data visualization

How the Pandas plot() method works by default

The three basic parameters for the Pandas plot() method

How to create 8 types of plots

How to create a line plot or an area plot

How to create a scatter plot

How to create a bar plot

How to create a histogram or a density plot

How to create a box plot or a pie plot

How to enhance a plot

How to improve the appearance of a plot

How to work with subplots

How to use chaining to get the plots you want

Chapter 4 The Seaborn essentials for data visualization

Introduction to Seaborn

The Seaborn methods for plotting

The general methods vs. the specific methods

How to use the basic Seaborn parameters

How to use the Seaborn parameters for working with subplots

How to enhance and save plots

How to set the title, x label, and y label

How to set the ticks, x limits, and y limits

How to set the background style

How to work with subplots

How to save a plot

How to create relational plots

How to create a line plot

How to create a scatter plot

How to create categorical plots

How to create a bar plot

How to create a box plot

How to create distribution plots

How to create a histogram

How to create a KDE or ECDF plot

How to enhance a distribution plot

Other techniques for enhancing a plot

How to use other Axes methods to enhance a plot

How to annotate a plot

How to set the color palette

How to enhance a plot that has subplots

How to customize the titles for subplots

How to set the size of a specific plot

Section 2 The critical skills for success on the job

Chapter 5 How to get the data

How to find the data that you want to analyze

Common data sources

How to find and select the data that you want

How to import data into a DataFrame

How to import data directly into a DataFrame

How to download a file to disk before importing it

How to work with a zip file on disk

How to get database data into a DataFrame

How to run queries against a database

How to use a SQL query to import data into a DataFrame

How to work with a Stata file

How to get and explore the metadata of a Stata file

How to build DataFrames for the metadata and the data

How to work with a JSON file

How to download a JSON file to disk

How to open a JSON file in JupyterLab

How to drill down into the data

How to build a DataFrame for the data

Chapter 6 How to clean the data

Introduction to data cleaning

A general plan for cleaning the data

What the info() method can tell you

What the unique values can tell you

What the value counts can tell you

How to simplify the data

How to drop rows based on conditions

How to drop duplicate rows

How to drop columns

How to rename columns

How to find and fix missing values

How to find missing values

How to drop rows with missing values

How to fill missing values

How to fix data type problems

How to find dates and numbers that are imported as objects

How to convert date and time strings to the datetime data type

How to convert object columns to numeric data types

How to work with the category data type

How to replace invalid values and convert a column’s data type

How to fix data problems when you import the data

How find and fix outliers

How to find outliers

How to fix outliers

Chapter 7  How to prepare the data

How to add and modify columns

How to work with datetime columns

How to work with string columns

How to work with numeric columns

How to add a summary column to a DataFrame

How to apply functions and lambda expressions

How to apply functions to rows or columns

How to apply user-defined functions

How lambda expressions work with DataFrames

How to apply lambda expressions

How to work with indexes

How to set and remove an index

How to unstack indexed data

How to combine DataFrames

How to join DataFrames with an inner join

How to join DataFrames with a left or outer join

How to merge DataFrames

How to concatenate DataFrames

The SettingWithCopyWarning

What the warning is telling you

How to handle the warning

Chapter 8  How to analyze the data

How to create and plot long data

How to melt columns to create long data

How to plot melted columns

How to group and aggregate the data

How to group and apply a single aggregate method

How to work with a DataFrameGroupBy object

How to apply multiple aggregate methods

How to create and use pivot tables

How to use the pivot() method

How to use the pivot_table() method

How to work with bins

How to create bins of equal size

How to create bins with equal numbers of unique values

How to plot binned data

More skills for data analysis

How to select the rows with the largest values

How to calculate the percent change

How to rank rows

How to find other methods for analysis

Chapter 9 How to analyze time-series data

How to reindex time-series data

How to generate time periods

How to reindex with datetime indexes

How to reindex with a semi-month index

How a user-defined function can improve a datetime index

How reindexing with an improved index can improve plots

How to resample time-series data

How to use the resample() method

How to use the label and closed parameters when you downsample

How downsampling can improve plots

How to work with rolling windows

The concept of rolling windows

How to create rolling windows

How to plot rolling window data

How to work with running totals

How to create running totals

How to plot running totals

Section 3 An introduction to predictive analysis

Chapter 10 How to make predictions with a linear regression model

Introduction to predictive analysis

Types of predictive models

Introduction to regression analysis

How to find correlations between variables

The Housing dataset

How to identify correlations with a scatter plot

How to identify correlations with a grid of scatter plots

How to identify correlations with r-values

How to identify correlations with a heatmap

How to use Scikit-learn to work with a linear regression

A procedure for creating and using a regression model

The function and methods for linear regression models

How to create, validate, and use a linear regression model

How to plot the predicted data

How to plot the residuals

How to plot regression models with Seaborn

The lmplot() method and some of its parameters

How to plot a simple linear regression

How to plot a logistic regression

How to plot a polynomial regression

How to plot a lowess regression

How to use the residplot() method to plot the residuals

Chapter 11 How to make predictions with a multiple regression model

A simple regression model for a Cars dataset

The Cars dataset

How to create a simple regression model

How to plot the residuals of a simple regression

How to work with a multiple regression model

How to create a multiple regression model

How to plot the residuals of a multiple regression

How to work with categorical variables

How to identify categorical variables

How to review categorical variables

How to create dummy variables

How to rescale the data and check the correlations

How to create a multiple regression that includes dummy variables

How to improve a multiple regression model

How to select the independent variables

How to test different combinations of variables

How to use Scikit-learn to select the variables

How to select the right number of variables

Section 4 The case studies

Chapter 12 The Polling case study

Get and display the data

Import the modules that you will need

Get the data

Display the data

Clean the data

Examine the data

Drop columns and rows

Rename columns

Fix object columns

Fix data

Take an early plot with Pandas

Save the DataFrame

Prepare the data

Add columns for grouping and filtering

Create a new DataFrame in long form

Take an early plot of the long data with Seaborn

Add monthly bins to the DataFrame

Add an average percent column for each month

Save the wide and long DataFrames

Analyze the data

Plot the national and swing state polls

Plot the voter types

Plot the last two months of polling

Plot the gap changes in selected states

More preparation and analysis

Prepare the gap data for the last week of polling

Plot the gap data for the last week of polling

Prepare the weekly gap data for the swing states

Plot the weekly gap data for the swing states

Chapter 13 The Forest Fires case study

Get the data

Connect and query the database

Import the data into a DataFrame

Clean the data

Examine the data

Improve the readability of the data

Drop unnecessary rows

Drop duplicate rows

Convert dates to datetime objects

Check for missing contain dates

Prepare the data

Add fire_month and days_burning columns

Examine the contain_date and days_burning columns

Analyze the data

Analyze the data for California

Two more plots for California fires

Rank the states by total acres burned

Prepare a DataFrame for total acres burned by year within state

Prepare a DataFrame for the top 4 states

Plot the acres burned total by year for the top 4 states

Review the 20 largest fires in California

Use GeoPandas to plot the fires on a map

Use GeoPandas to plot the California map

Use GeoPandas or Seaborn to plot the California fires on a map

Plot the fires in the continental United States

Chapter 14 The Social Survey case study

Introduction to the Social Survey

Build a DataFrame for the metadata

The employment data

Use the codebook and read the data that you want

Prepare the data

Plot the data and reduce the number of categories

Plot the total counts of the responses

Convert the counts to percents and plot them

The work-life balance data

Search the codebook for small question sets

Read and review the work-life data

Plot the responses for the first question

Plot the responses for the second and third questions

How to expand the scope of the analysis

Use the codebook to find related columns

Use the codebook to find follow-up questions

Select the columns for an expanded DataFrame

Bin the data for a column

How to use a hypothesis to guide your analysis

Develop and test a first hypothesis

Develop and test a second hypothesis

Develop and test a third hypothesis

Chapter 15 The Sports Analytics case study

Get the data and build the DataFrame

Get the data

Build the DataFrame

Clean the data

Locate and drop unneeded rows

Locate and drop unneeded columns

Convert the game_date column to datetime data

Prepare the data

Add a column for the season

Add a column for the shot result

Add a column for points made for each shot

Add three summary columns

Plot the summary data

Plot the points per game by season

Plot the averages of shots, shots made, and points per game by season

Plot the shot locations

Plot the shot locations for two games

Plot the shot locations for two seasons

Plot the shot density for one season

Plot the shot density for two seasons

Appendices

Appendix A How to set up Windows for this book

How to download the files for this book

How to install Anaconda

How to use the Anaconda Navigator

How to create the murach environment

How to unzip some data and test your setup

How to use the Anaconda Prompt

Appendix B How to set up macOS for this book

How to download the files for this book

How to install Anaconda

How to use the Anaconda Navigator

How to create the murach environment

How to unzip some data and test your setup

How to use Terminal with an environment

If you aren’t familiar with the supporting materials that we provide for our books, please visit the About our Courseware page to learn what we provide and how each component can work for you and your students.

If you’re already familiar with our supporting materials from other books, here’s a quick summary of the courseware available for this book.

End-of-chapter activities in the book

  • Terms list
  • Summary bullets
  • Exercises

Student download

  • All examples presented in the book
  • The starting points for the exercises at the end of each chapter
  • The solutions to those exercises
  • The case studies presented in the book

Appendixes A (Windows) and B (macOS) give your students instructions for downloading these files.

Instructor’s materials

  • Instructional objectives
  • Test banks
  • Projects for student practice and evaluation
  • Case studies for more extensive practice and evaluation
  • PowerPoint slides

We provide the files for all of these materials in a zip file. After unzipping this file, you can import these materials into any LMS.

We also provide a Canvas course file that contains most of these files. If you’re using the Canvas LMS, you can use this file to import these materials with just a few clicks.

On this page, we’ll be posting answers to the questions that come up most often about this book. So if you have any questions that you haven’t found answered here at our site, please email us. Thanks!

There are no book corrections that we know of at this time. But if you find any, please email us, and we’ll post any corrections that affect the technical accuracy of the book here. Thank you!

Murach college books and courseware since 1974