# Comments start with a hash
print(f"Hello, World!")

Hello, World!


print("Hello JALT")

Hello JALT


a = "I eat pizza"
a

'I eat pizza'


a = a + " every. single. day. 🍕😜"
a

'I eat pizza every. single. day. 🍕😜'

Does incorporating pizza-related content into language learning materials positively impact English as a Foreign Language (EFL) learners' language acquisition, motivation, and confidence?


import pandas as pd


pre_df = pd.read_excel("pre_pizza_material.xlsx")


pre_df.head()


post_df = pd.read_excel("post_pizza_material.xlsx")
post_df.head()


df = pd.merge(
    pre_df, post_df,
    on="ID Number",
    suffixes=("_pre", "_post"),
    validate="one_to_one"
)

# Let's see it!
df.head()


df.shape

(72, 17)


round(df.describe(), 2)


df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 72 entries, 0 to 71
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID Number              72 non-null     int64  
 1   Name                   72 non-null     object 
 2   Last Name              72 non-null     object 
 3   Age                    72 non-null     int64  
 4   Sex                    72 non-null     object 
 5   yrs_eflexp             72 non-null     int64  
 6   favorite_topping_pre   72 non-null     object 
 7   pineapple              72 non-null     object 
 8   exp_pzmat              72 non-null     object 
 9   score_pre              72 non-null     int64  
 10  motivation_pre         72 non-null     int64  
 11  confidence_pre         71 non-null     float64
 12  hrs_pzmat              72 non-null     int64  
 13  score_post             72 non-null     int64  
 14  motivation_post        72 non-null     int64  
 15  confidence_post        72 non-null     int64  
 16  favorite_topping_post  72 non-null     object 
dtypes: float64(1), int64(9), object(7)
memory usage: 10.1+ KB


# Let's select the Age column in our dataframe
df["Age"]

0     18
1     19
2     19
3     19
4     18
      ..
67    22
68    19
69    22
70    18
71    20
Name: Age, Length: 72, dtype: int64


df[["Age", "Sex"]]   # The list is ["Age", "Sex"]


df.loc[4:10]


df.loc[4:10, "Name"] 
# note that you could provide more columns as a list.

4      William
5          Ava
6     Benjamin
7       Sophia
8        Mason
9          Mia
10        Noah
Name: Name, dtype: object


# "Show us whether or not the values in the 'favorite_topping_pre' column are equal to 'Pineapple'"
df["favorite_topping_pre"] == "Pineapple"

0     False
1     False
2     False
3     False
4     False
      ...  
67    False
68    False
69    False
70    False
71    False
Name: favorite_topping_pre, Length: 72, dtype: bool


# Make a mask
mask = df["favorite_topping_pre"] == "Pineapple"

# Apply the mask (only shows where the mask is 'True')
df[mask]


# Make a new column 'start_efl_year'
# In some sense, you can imagine that this is going through row by row and doing the substraction
# and the new value is being placed in the new column.
df["start_efl_year"] = df["Age"] - df["yrs_eflexp"]


columns_to_drop = ["ID Number", "Name", "Last Name"] # a list of column names
df = df.drop(columns_to_drop, axis=1) # axis 1 means "column"


df.to_csv("pzmat_data_merged.csv", index=False) # we don't want the index


target_column = df["confidence_pre"]
# target_column will be a Series


confidence_is_null_mask = pd.isna(target_column)
# result is a boolean Series


df[confidence_is_null_mask]


confidence_value_counts = df["confidence_pre"].value_counts()
# result is a Series


# Get the index value of the most frequent
# index 0 is the first one
most_frequent_value = confidence_value_counts.index[0]


# Let's see what it is
most_frequent_value

2.0


# Replace the missing value using fillna()
df["confidence_pre"] = df["confidence_pre"].fillna(most_frequent_value)


df["favorite_topping_pre"].unique()

array(['Pepperoni', 'Sausage', 'Onions', 'Olives', 'Green Peppers',
       'Chicken', 'Bacon', 'Mushrooms', 'Pineapple', 'Pinapple'],
      dtype=object)


# Let's change "Pinapple" to "Pineapple"
df["favorite_topping_pre"] = df["favorite_topping_pre"].replace({"Pinapple": "Pineapple"})


# We will make a mask that is combined with two conditions.
# Notice the two conditions are separated by a pipe character which means OR 
# So we are looking for anything less than 0 OR greater than 50

out_of_range_mask = (df["hrs_pzmat"] < 0) | (df["hrs_pzmat"] > 50)


# Let's see who is out of range
df[out_of_range_mask]


# Now we need a mask for those participants that are in range
# This happens to be the inverse of the out_of_range_mask we made above
in_range_mask = ~out_of_range_mask   # invert a mask with a prefixed tilde ~

# We need another mask for Male participants that are 19
# These are separated by an ampersand (&) meaning a logical AND
male_19_mask = (df["Age"] == 19) & (df["Sex"] == "Male")

# We can now apply the mask to the dataframe
# Notice that we are combining with AND:
#   in_range_mask & male_19_mask
# Then we get the mean for the column, and finally round it
value = df[in_range_mask & male_19_mask]["hrs_pzmat"].mean()
average_hours_19_male = round(value)

# Let's see what it is
average_hours_19_male

20


# Now we will apply the value
# We already have the masks that we need in order to select the row we want
# We also need to specify the column we are assigning the new value to

df.loc[out_of_range_mask & male_19_mask, "hrs_pzmat"] = average_hours_19_male


# Make a mask that would show anything that is not y or not n
yes_no_mask = (df["pineapple"] != "y") & (df["pineapple"] != "n")

# Apply the mask and look at the pineapple column
df[yes_no_mask]["pineapple"]

50    NO!
Name: pineapple, dtype: object


df["pineapple"] = df["pineapple"].replace({"NO!": "n"})


# Make a list of columns we will convert to uppercase
text_input_columns = ["favorite_topping_pre", "favorite_topping_post", "Sex"]

# Make a 'for' loop
for column in text_input_columns:
    # for every column in the list of columns, change the string to uppercase
    df[column] = df[column].str.upper()

# Let's see the result by selecting those columns
df[text_input_columns]


# A function that will return True for a "y" value or False for a "n" value
def convert_yes_no(value):
    if value == "y":
        return True
    elif value == "n":
        return False
    return value # if there wasn't a y / n, return the original value

# Specify the columns in a list
yes_no_columns = ["pineapple", "exp_pzmat"]

# Iterate over the list of columns and use apply to convert the values
for column in yes_no_columns:
    # each value in the column will be passed to the convert_yes_no function
    # the original value will be replaced by the returned value
    df[column] = df[column].apply(convert_yes_no)

# Let's see the result by selecting those columns
df[yes_no_columns]


total_score = 100
df["score_pre"] = df["score_pre"] / total_score
df["score_post"] = df["score_post"] / total_score

# Let's see the effect
df[["score_pre", "score_post"]]


male_female_df = df.groupby("Sex")

# The above gives us the entire dataframe grouped by Sex
# But we only need the "Sex" column

male_female_series = male_female_df["Sex"]


male_female_counts = male_female_series.count()  # will be a groupby series
male_female_counts

Sex
FEMALE    36
MALE      36
Name: Sex, dtype: int64


male_female_counts.plot.pie(title="Breakdown of Male/Female Participants", autopct='%1.1f%%')

<AxesSubplot: title={'center': 'Breakdown of Male/Female Participants'}, ylabel='Sex'>


# The above steps were broken down but could be done in one line:
# df.groupby("Sex")["Sex"].count().plot.pie(title="Breakdown of Male/Female Participants", autopct='%1.1f%%')


min_age = df["Age"].min()
max_age = df["Age"].max()
mean_age = round(df["Age"].mean(), 2)  # rounded to 2 decimal places

# Print out a sentence summarizing this:
print(f"""The minimum age of participants was {min_age}, \
the maximum was {max_age} and the mean was {mean_age}.""")

The minimum age of participants was 18, the maximum was 22 and the mean was 19.54.


df["Age"].value_counts(sort=False).plot.bar(
    title=f"Distribution of Age (Mean = {mean_age})",
    xlabel="Age",        # label the x-axis
    ylabel="Frequency",  # label the y-axis
    rot=0,               # rotate the x-axis labels
    figsize=(5,3),
);


# Make a group using two columns
# select the Age column
# then and do a count (aggregate)
sex_age_series = df.groupby(["Sex", "Age"])["Age"].count()
sex_age_series

Sex     Age
FEMALE  18      2
        19     15
        20      6
        21      4
        22      9
MALE    18     13
        19     16
        20      2
        21      4
        22      1
Name: Age, dtype: int64


# The result from above is a Series with a special index called a MultiIndex
# We need to unstack the series from above
# The unstack() method reshapes the Series
# The result is something like doing a pivot and will be a DataFrame
sex_age_df = sex_age_series.unstack()
sex_age_df


# Plot the bar graph, indicating that it should be stacked
sex_age_df.plot.bar(
    stacked=True,
    title="Stacked Bar Chart of Age by Sex",
    xlabel="Sex",
    legend="reverse",
    rot=0,
    colormap="tab20",
    figsize=(5,3),
    );


from wordcloud import WordCloud


import matplotlib.pyplot as plt


# We need a python dictionary of the value_counts
words = df["favorite_topping_pre"].value_counts().to_dict()
words

{'ONIONS': 10,
 'OLIVES': 9,
 'GREEN PEPPERS': 9,
 'CHICKEN': 9,
 'BACON': 9,
 'PEPPERONI': 8,
 'SAUSAGE': 8,
 'MUSHROOMS': 6,
 'PINEAPPLE': 4}


# Make the wordcloud
wordcloud = WordCloud(
    background_color="white"
    ).generate_from_frequencies(words)


# Use the matplotlib library to show the image
plt.figure(figsize=(5,3))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


# First we'll select and multiply the score columns
# (They are in float / percent format)
scores = df[["score_pre", "score_post"]] * 100

# We'll also rename the columns
scores.columns = ["Pre-Test Scores", "Post-Test Scores"]


# Make a box plot
scores.plot.box(
    title="Pre/Post Score Box Plot Comparison",
    ylabel="Score (Percent)",
    figsize=(5,3)
);


# Select the columns we want to use (this will make a new dataframe we can use)
mot_conf_df = df[[
     "motivation_pre",
     "motivation_post",
     "confidence_pre",
     "confidence_post"
    ]]

# Rename the columns
mot_conf_df.columns = [
    "Motivation (Pre)", 
    "Motivation (Post)", 
    "Confidence (Pre)", 
    "Confidence (Post)"
]


# Plot it with a title and specify the y ticks
mot_conf_df.plot.box(
    title="Motivation and Confidence Pre/Post Comparison",
    yticks=[1,2,3,4,5],
    figsize=(7,3)
);


df["motivation_change"] = df["motivation_post"] - df["motivation_pre"]


df.plot.scatter(
    x="hrs_pzmat",
    y="motivation_change",
    title="Hours Spent on Material Correlated to Change in Motivation",
    xlabel="Hours Spent on Material",
    ylabel="Change in Motivation",
    figsize=(5,3),
);


# We need a python dictionary of the value_counts
words = df["favorite_topping_post"].value_counts().to_dict()

# Make the wordcloud
wordcloud = WordCloud(
    background_color="white"
).generate_from_frequencies(words)


# Use the matplotlib library to show the image
plt.figure(figsize=(5,3))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


mask = (df["hrs_pzmat"] < 0) | (df["hrs_pzmat"] > 50)


out_of_range = len(df[mask])


assert out_of_range == 0, "There are values in 'hrs_pzmat' that are out of range"


#This could be done with one line of code:
    
assert len(df[(df["hrs_pzmat"] < 0) | (df["hrs_pzmat"] > 50)]) == 0, "There are values in 'hrs_pzmat' that are out of range"


def run_tests(df):
    assert len(df[(df["hrs_pzmat"] < 0) | (df["hrs_pzmat"] > 50)]) == 0, "There are values in 'hrs_pzmat' that are out of range"
    # other tests
    
run_tests(df)


import matplotlib.pyplot as plt


# Make a figure and axes
fig, ax = plt.subplots()

# plot the scores
df["score_pre"].plot.hist(
    ax=ax, 
    title="Pre-Material Score Distribution (as Percent)", 
    )

# Save the figure
# This will also show it in the notebook
# The filename we will use is figure01.png
# This will be saved in the current directory
plt.savefig("figure01.png")


df.groupby("Sex")["hrs_pzmat"].mean().plot.bar()

<AxesSubplot: xlabel='Sex'>


# We need to use matplotlib.pyplot for this
import matplotlib.pyplot as plt

# Make a figure and axes
fig, ax = plt.subplots()

# Use some different colors for male/female
colors = ["#a855f7", "#3b82f6"]

# plot the bar graph and assign the axes to use
title = "Average Hours Spent on Pizza Material, Split by Sex"
df.groupby("Sex")["hrs_pzmat"].mean().plot.bar(
    ax=ax,
    title=title,
    ylabel="Average Hours",
    color=colors,
    rot=0,
    )

# Iterate through each bar (patch) and annotate it
for patch in ax.patches:
    
    # Get the actual value for the bar
    value = round(patch.get_height(), 2)
    
    # Get the x and y coordinates for the annotation
    # based on the geometry of the bar
    # we'll make them vertically aligned at 10
    x_coord = patch.get_x() + patch.get_width() / 2
    y_coord = 10
    
    # Apply an annotation
    ax.annotate(
        str(value),
        (x_coord, y_coord),
        ha="center",
        va="center",
        color="white"
    )

# Add a horizontal line to the axes
hrs_mean = df["hrs_pzmat"].mean()
ax.axhline(y=hrs_mean, color="#94a3b8", linestyle="-")

# Add some text just above the hrs_mean line
ax.text(
    -0.45, hrs_mean + 0.3,
    f"Overall Average ({round(hrs_mean, 2)})",
    color="#334155"
    )

# Show the figure
plt.show()

	ID Number	Age	yrs_eflexp	score_pre	motivation_pre	confidence_pre	hrs_pzmat	score_post	motivation_post	confidence_post
count	72.00	72.00	72.00	72.00	72.00	71.00	72.00	72.00	72.00	72.00
mean	91600158.50	19.54	9.83	48.11	1.94	2.27	25.79	75.86	3.85	3.26
std	20.93	1.32	2.69	16.67	0.98	0.72	69.36	12.89	1.03	1.29
min	91600123.00	18.00	5.00	11.00	1.00	1.00	2.00	34.00	1.00	1.00
25%	91600140.75	19.00	7.75	34.00	1.00	2.00	13.75	66.75	3.00	2.00
50%	91600158.50	19.00	10.00	48.50	2.00	2.00	16.00	76.00	4.00	3.50
75%	91600176.25	20.25	12.00	61.25	3.00	3.00	24.00	85.25	5.00	4.00
max	91600194.00	22.00	14.00	79.00	5.00	4.00	603.00	99.00	5.00	5.00

	score_pre	score_post
0	0.71	0.82
1	0.45	0.76
2	0.33	0.68
3	0.62	0.90
4	0.56	0.78
...	...	...
67	0.11	0.63
68	0.54	0.87
69	0.34	0.67
70	0.52	0.67
71	0.21	0.34

	ID Number	Name	Last Name	Age	Sex	yrs_eflexp	favorite_topping	pineapple	exp_pzmat	score	motivation	confidence
0	91600124	John	Smith	18	Male	13	Pepperoni	y	n	71	2	2.0
1	91600185	Olivia	Johnson	19	Female	9	Sausage	y	n	45	2	3.0
2	91600140	James	Brown	19	Male	14	Onions	y	n	33	2	2.0
3	91600145	Emma	Williams	19	Female	10	Olives	n	n	62	1	2.0
4	91600173	William	Jones	18	Male	11	Green Peppers	n	n	56	1	1.0

	ID Number	hrs_pzmat	score	motivation	confidence	favorite_topping
0	91600124	25	82	3	2	Pineapple
1	91600185	14	76	2	5	Pineapple
2	91600140	20	68	2	3	Pineapple
3	91600145	11	90	1	4	Pineapple
4	91600173	18	78	4	4	Pineapple

	ID Number	Name	Last Name	Age	Sex	yrs_eflexp	favorite_topping_pre	pineapple	exp_pzmat	score_pre	motivation_pre	confidence_pre	hrs_pzmat	score_post	motivation_post	confidence_post	favorite_topping_post
15	91600163	Mia	White	19	Female	12	Pineapple	n	n	36	2	4.0	13	59	3	5	Pineapple
31	91600141	Sophia	Hall	21	Female	7	Pineapple	y	n	63	3	2.0	25	79	5	3	Pineapple
51	91600139	Lily	Garcia	18	Female	14	Pineapple	n	n	34	1	2.0	28	76	5	1	Pineapple

	pineapple	exp_pzmat
0	True	False
1	True	False
2	True	False
3	False	False
4	False	False
...	...	...
67	False	False
68	False	True
69	False	False
70	True	False
71	False	False

	favorite_topping_pre	favorite_topping_post	Sex
0	PEPPERONI	PINEAPPLE	MALE
1	SAUSAGE	PINEAPPLE	FEMALE
2	ONIONS	PINEAPPLE	MALE
3	OLIVES	PINEAPPLE	FEMALE
4	GREEN PEPPERS	PINEAPPLE	MALE
...	...	...	...
67	OLIVES	PINEAPPLE	FEMALE
68	ONIONS	PINEAPPLE	MALE
69	GREEN PEPPERS	PINEAPPLE	FEMALE
70	PEPPERONI	PINEAPPLE	MALE
71	CHICKEN	PINEAPPLE	FEMALE

An Introduction to Data Analysis with Python¶

Table of Contents¶

Introduction¶

Tools for Data Analysis¶

A very short introduction to Python¶

Why use Python for Data Analysis?¶

A very short introduction to Pandas¶

A very short introduction to Jupyter Notebooks¶

Some Example Cells:¶

Our Research Project¶

Getting the Data into Python¶

Loading Data¶

Merging the two Dataframes¶

Handling a Dataframe¶

Inspecting the Dataframe¶

Selecting Parts of a DataFrame¶

Selecting a Column¶

Selecting Multiple Columns¶

Selecting Rows¶

Selecting a Slice¶

Applying a Mask (Filtering)¶

Making or Overwriting Columns¶

Removing Columns¶

Exporting a Dataframe¶

Data Cleaning¶

Data Cleaning Task #1: A Missing Value¶

Data Cleaning Task #2: A Spelling Mistake¶

Data Cleaning Task #3: A Number Mistake¶

Data Cleaning Task #4: An Input Error¶

Data Cleaning Task #5: Convert Text Input Columns to Upper Case¶

Data Cleaning Task #6: Convert Yes/No Columns to Boolean¶

Data Cleaning Task #7: Convert the Test Scores to Percentages¶

An Important Note¶

Data Analysis / Visualization¶

Analysis 1: Overview of the Participants¶

Ratio of Male and Female Participants (Pie Chart)¶

Distribution of Age (Bar Chart / Histogram)¶

Male/Female Breakdown by Age (Stacked Bar Chart)¶

Favorite Pizza Topping (as a Word Cloud)¶

Analysis 2: Pre vs. Post Data¶

Comparing Pre-Post Scores (Box Plot)¶

Motivation and Confidence Pre/Post Comparison (Box Plots)¶

Analysis 3: Number of Hours Correlated with Motivation¶

Analysis 4: Post-Favorite Pizza Topping¶

Research Conclusion¶

Resources¶

Try it out!¶

Installation¶

Learning¶

More Resources¶

Related Libraries for Data Analysis and Visualization¶

Publishing Jupyter Notebooks¶

Create Dashboards or Webapps¶

Appendix 1: Testing¶

Appendix 2: Saving a Figure (plot) to File¶

Appendix 3: Customizing a Figure¶