一次偶然的机会看到王树义老师的推荐YouTube上的这个pandas视频教程，于是花了点时间学习了一下，也就有了这篇笔记，因为课程是英文的，也没有提供中文字幕，所以我的笔记也就懒得再用中文写了。

利用Pandas分析美国交警开放的搜查数据

要求

The tutorial will not cover pandas basics. If you are new to pandas or just need a refresher, I recommend watching some videos from my free pandas course. Alternatively, you can review all of the code from my pandas course in this Jupyter notebook.

参考

开始

DownLoad the data

原始数据

教程使用的数据

police

About the data

For each state in the dataset, They provide data in three formats(We use csv format)
If a column cannot be computed using the data a state has provided, it is set to NA.
For several fields (e.g., driver_race) we include a "raw" column which records the original data values from which we infer standardized values

We include the raw columns because our data processing pipeline is extensive, requiring judgment calls and subjective decisions.

Column name	Column meaning	Example value
id	The unique ID we assign to each stop. Contains the state and year.	VT-2011-00012
state（州）	The two-letter code for the state in which the stop occurred.	VT
stop_date（被搜查的日期）	The date of the stop, in YYYY-MM-DD format. Some states do not provide the exact stop date: for example, they only provide the year or quarter in which the stop occurred. For these states, stop_date is set to the date at the beginning of the period: for example, January 1 if only year is provided.	2011-11-27
stop_time(被警察搜查时的时间)	The 24-hour time of the stop, in HH:MM format.	20:15
location_raw	The original data value from which we compute the county (or comparably granular location) in which the stop occurred. Not in a standardized format across states.	Winooski
county_name	The standardized name of the county in which the stop occurred.	Chittenden County
county_fips	The standardized 5-digit FIPS code in which the stop occurred.	50007
district	In several states (e.g., Illinois) the stop county cannot be inferred, but a comparably granular location can. This comparably granular location is stored in the district column. Most states do not have this column.	ILLINOIS STATE POLICE 01
fine_grained_location	Any higher-resolution data about where the stop occurred: e.g., milepost or address. Not standardized across states.	90400 I 89 N; EXIT 15 MM90/40
police_department	The police department or agency that made the stop. Not in a standard format across states.	WILLISTON VSP
driver_gender	The driver’s gender, as recorded by the trooper. M, F, or NA.	M
driver_age_raw	The original data value from which we compute the driver’s age when they were stopped. May be age, birth year, or birth date. Not in a standard format across states.	1988
driver_age	The driver’s age when they were stopped. Set to NA if less than 15 or greater than or equal to 100.	23
driver_race_raw	The original data value from which the driver’s standardized race is computed. Not in a standard format across states.	African American
driver_race	The standardized driver race. Possible values are White, Black, Hispanic, Asian, Other, and NA, with NA denoting values which are unknown. Asian refers to Asian, Pacific Islander, and Indian. Native Americans/American Indians are included in the "other" category. Anyone with Hispanic ethnicity is classified as Hispanic, regardless of their recorded race.	Black
violation_raw	The violation committed by the driver, in the language of the original data. Not in a standard format across states. Some stops have multiple violations.	Speeding (10–19 MPH Over Prima Facie Limit *)
violation	The violation committed by the driver, standardized into categories which are consistent across states.	Speeding
search_conducted（是否被搜查）	A TRUE/FALSE value indicating whether a search was performed.	TRUE
search_type_raw	The justification for the search, in the language of the original data. NA if no search was performed. Not in a standard format across states. Some states have multiple justifications for a search.	CONSENT SEARCH CONDUCTED
search_type	The normalized justification for the search. Where possible, this is standardized into categories which are consistent across states. For example, if something is clearly a consent search, search_type is referred to as “Consent”.	Consent
contraband_found	A TRUE/FALSE value indicating whether a search was performed and contraband was found. FALSE if no search was performed.	TRUE
stop_outcome	The outcome of the stop. Many states have idiosyncratic outcomes — for example, “CHP 215” in California — so this column is not standardized across states. “Citation” and “Warning” are the values which occur most commonly across states. If the stop has multiple outcomes, the most severe outcome is used. For example, if a stop resulted in a citation and a warning, stop_outcome would be “Citation”.	Citation
is_arrested	A TRUE/FALSE value indicating whether an arrest was made.	TRUE

FIPS：联邦信息处理标准（英语：Federal Information Processing Standards，缩写FIPS）
VT: 美国的一个州
M,F: 记录性别，M男，F女
race: 可不是速度或者竞赛，而是种族,参考分类如下，与数据里的可能不一样，了解更多
1. American Indian or Alaska Native
2. Asian
3. Black or African American
4. Native Hawaiian or Other Pacific Islander
5. White
violation：违章
NAN means the value is missing or irrelevant for that data row.

Know the data

pd.read_csv() # 读取csv文件
pd.head() # 查看表头
pd.tail() # 查看表尾
pd.dtypes # 查看列的数据类型
pd.isnull().sum() # 查看null值数量

Import the data and check the dtypes:

1 2	ri = pd.read_csv("/Users/wittyfans/Desktop/pycon-2018-tutorial/police.csv") ri.dtypes

Output:

stop_date                     object
stop_time                     object
driver_gender                 object
driver_age_raw               float64
driver_age                   float64
driver_race                   object
violation_raw                 object
violation                     object
search_conducted                bool
search_type                   object
stop_outcome                  object
is_arrested                   object
stop_duration                 object
drugs_related_stop              bool
frisk                         object
stop_datetime         datetime64[ns]
dtype: object

可以看到有多少列，以及这列的数据类型，对数据有一个基本的了解。

年龄占比

种族占比

毒驾被查时段分布

男性违章类别前十

女性违章类别前十

男女违章类别对比

查获违章数（按年）

违章查获时间段与年份

Handle the data

1. Remove the column that only contains missing values

Drop column

ri.drop('value',aixs='columns',inplace=True)

# aixs = 'columns'即我们要进行的是列操作，有时候也用aixs = 1代替。
# inplace = True,即处理完后替代原来的数据，相当于下面这种写法：

ri = ri.drop('county_name',axis=1)

There is another way to drop the columns, like:

1 2	del ri['columns_name']

you can do that as this, that's fine, but it's not the Pandas way.

Tips:

Pay attention to default arguments
check your work
There is more than one way to do everything in pandas

2. Do men or women speed more often?

driver_gender,violaton

1	ri[ri.violation == "Speeding"].driver_gender.value_counts()

output:

1
2
3

M    32979
F    15482
Name: driver_gender, dtype: int64

1 2	ri[ri.violation == "Speeding"].driver_gender.value_counts(normalize=True) # get the percent value, 如果一个人超速，那么多大几率是男性或女性

output:

1
2
3

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

反过来看，女性：

1	ri[ri.driver_gender == 'F'].violation.value_counts(normalize=True)

output:

Speeding               0.658500
Moving violation       0.136277
Equipment              0.105780
Registration/plates    0.043086
Other                  0.029348
Seat belt              0.027009
Name: violation, dtype: float64

男性：

1	ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True)

output:

Speeding               0.524350
Moving violation       0.207012
Equipment              0.135671
Other                  0.057668
Registration/plates    0.038461
Seat belt              0.036839
Name: violation, dtype: float64

合并成一条命令：

1	ri.groupby('driver_gender').violation.value_counts(normalize=True)

output:

driver_gender  violation          
F              Speeding               0.658500
               Moving violation       0.136277
               Equipment              0.105780
               Registration/plates    0.043086
               Other                  0.029348
               Seat belt              0.027009
M              Speeding               0.524350
               Moving violation       0.207012
               Equipment              0.135671
               Other                  0.057668
               Registration/plates    0.038461
               Seat belt              0.036839
Name: violation, dtype: float64

when you wirte the groupby, just think for each value(in here is gender, man or women).

count(): 返回所有数据统计之和
value_counts(): 分别统计该数据分类下（如果有）的值
normalize = True: 该选项等于True，则会返回 relative frequencies of the unique values.

3. Does gender affect who gets searched during a stop?

driver_gender,search_conducted

1	ri.search_conducted.value_counts(normalize=True)

output:

1
2
3

False    0.965163
True     0.034837
Name: search_conducted, dtype: float64

Same results with above, how does that work? If the serious value is bool type and you can always just get a sum to konw the number of ones, or a mean to get the percentage of ones, that will simplify you code.

1	ri.search_conducted.mean()

output:

1	0.03483720473942948

groupby gender:

1 2	ri.groupby('driver_gender').search_conducted.mean()

output:

driver_gender
F    0.020033
M    0.043326
Name: search_conducted, dtype: float64

Does that prove any things? No, - Causation is difficult to conclude, so focus on relationships. - Include all relevant factors when studying a relationship.

Let's see details:

1	ri.groupby(['driver_gender','violation']).search_conducted.mean()

output:

driver_gender  violation          
F              Equipment              0.042622
               Moving violation       0.036205
               Other                  0.056522
               Registration/plates    0.066140
               Seat belt              0.012598
               Speeding               0.008720
M              Equipment              0.070081
               Moving violation       0.059831
               Other                  0.047146
               Registration/plates    0.110376
               Seat belt              0.037980
               Speeding               0.024925
Name: search_conducted, dtype: float64

4. Why is search_type missing so often?

1 2	ri.search_conducted.value_counts()

output:

False    88545
True      3196
Name: search_conducted, dtype: int64

search_conducted means a search performed, 即一个车主被警察查了。

1 2	ri.search_type.count()

output:

Incident to Arrest                                          1219
Probable Cause                                               891
Inventory                                                    220
Reasonable Suspicion                                         197
Protective Frisk                                             161
Incident to Arrest,Inventory                                 129
Incident to Arrest,Probable Cause                            106
Probable Cause,Reasonable Suspicion                           75
Incident to Arrest,Inventory,Probable Cause                   34
Probable Cause,Protective Frisk                               33
Incident to Arrest,Protective Frisk                           33
Inventory,Probable Cause                                      22
Incident to Arrest,Reasonable Suspicion                       13
Incident to Arrest,Inventory,Protective Frisk                 11
Inventory,Protective Frisk                                    11
Protective Frisk,Reasonable Suspicion                         11
Incident to Arrest,Probable Cause,Protective Frisk            10
Incident to Arrest,Probable Cause,Reasonable Suspicion         6
Incident to Arrest,Inventory,Reasonable Suspicion              4
Inventory,Reasonable Suspicion                                 4
Inventory,Probable Cause,Reasonable Suspicion                  2
Inventory,Probable Cause,Protective Frisk                      2
Probable Cause,Protective Frisk,Reasonable Suspicion           1
Incident to Arrest,Protective Frisk,Reasonable Suspicion       1
Name: search_type, dtype: int64

search_type, the reason for search performed, 即警察为什么要查这个车。

1	ri[ri.search_conducted == False].search_type.value_counts()

output:

1	Series([], Name: search_type, dtype: int64)

return empty series, why? because there is no value to count, nan and null is not a value, if it's not what you want,add dropna= False in the value_counts() method likt this:

1	ri[ri.search_conducted == False].search_type.value_counts(dropna=False)

5. During a search, how often is the driver frisked?

How to know is there a word in a string in Python? You may use 'a' in b, this is the pythoy way, we got pandas way like this:

1
2
3


ri.search_type.str.contains()

针对每一行，判断是否有字符串在其中，有则返回True，无则False.

1 2	ri['frisk'] = ri.search_type.str.contains('Probable Cause') ri.frisk

output:

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
5          NaN
6          NaN
7          NaN
...

1	ri.frisk.value_counts(dropna=False)

output:

NaN      88545
False     2014
True      1182
Name: frisk, dtype: int64

1
2
3

ri.frisk.sum() -> 1182
ri.frisk.mean() -> 0.36983729662077597
1182/(1182+2014) -> 0.36983729662077597

the mean, is caculated by frisked / (frisked+unfrisked), and will excluded the missing values.

6. Which year had the least number of stops?

Method 1: Use the str.slice

1 2	ri.stop_date.str.slice(0,4).value_counts()

output:

2012    10970
2006    10639
2007     9476
2014     9228
2008     8752
2015     8599
2011     8126
2013     7924
2009     7908
2010     7561
2005     2558

The stop_date format is: 2005-01-02, so slice(0,4) will return 2005, then do a value_counts will get the number of stops in all years.

Method 2: Use the datetime

1 2	combined = ri.stop_date.str.cat(ri.stop_time,sep=' ') ri['stop_datetime'] = pd.to_datetime(combined)

output:

2012    10970
2006    10639
2007     9476
2014     9228
2008     8752
2015     8599
2011     8126
2013     7924
2009     7908
2010     7561
2005     2558

Use the pd.to_datetime method to convert a concated string to a datatime type value, then add a new col in ri. cat is a string method to concat two string, the sep arg is a string or char you want insert in center.

So, after that, how to get the year value?

1	ri.stop_datetime.dt.year.value_counts()

datetime has a method .dt, and you can .year, .day, etc.

7. How does drug activity change by time of day?

stop_datetime,drugs_related_stop

Let's specific, for every hour, what is the rate of the drugs_related_stop?

1	ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean()

output:

stop_datetime
0     0.019728
1     0.013507
2     0.015462
3     0.017065
4     0.011811
5     0.004762
6     0.003040
7     0.003281
8     0.002687
9     0.006288
10    0.005714
11    0.006976
12    0.004467
13    0.010326
14    0.007810
15    0.006416
16    0.005723
17    0.005517
18    0.010148
19    0.011596
20    0.008084
21    0.013342
22    0.013533
23    0.016344
Name: drugs_related_stop, dtype: float64

show it!

1	ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean().plot()

output:

if you want the notebook auto show the img, make sure you imported the packge.

1
2
3

import matplotlib.pyplot as plt
%matplotlib inline

then when you type .plot() the img will show automatically.

Lesssons: - Consider removing chunks of data that may be biased - Use the datetime data type for dates and times

8. Do most stops occur at night?

Let do this:

1 2	ri.stop_datetime.dt.hour.value_counts().plot()

you will get an error img, because the plot is generated by the value, not the index(in here is the hour), so you need to sort with index first,

1 2	ri.stop_datetime.dt.hour.value_counts().sort_index().plot()

ouput:

and there is a details version:

bt, how to define the night, In my view 22-4 is night, I want to filter the hours:

1
2
3


ri[(ri.stop_datetime.dt.hour > 4) & (ri.stop_datetime.dt.hour < 22)].shape

if you have a category, want to apply the hours, you can define a method and map to the hours.

9. Find the bad data in the stop_duration column and fix it

Check this value counts:

1	ri.stop_duration.value_counts()

output:

0-15 Min     69543
16-30 Min    13635
30+ Min       3228
2                1
1                1
Name: stop_duration, dtype: int64

We want to turn the stop_duration value equals '1' to 'NAN', how?

1 2	ri[(ri.stop_duration == '1') \| (ri.stop_duration == '2')].stop_duration == 'NAN'

that looks make sense,but check the value counts, it does'nt work.

0-15 Min     69543
16-30 Min    13635
30+ Min       3228
2                1
1                1
Name: stop_duration, dtype: int64

why? acturlly, you may get a warnning,

A value is trying to be set on a copy of a slice from a DataFrame.

You should use loc.

1	ri.loc[(ri.stop_duration == '1') \| (ri.stop_duration == '2'),'stop_duration'] = 'NAN'

check again:

0-15 Min     69543
16-30 Min    13635
30+ Min       3228
NAN              2
Name: stop_duration, dtype: int64

注意比较一下这两者用法上的区别，如果想要了解更多loc的用法，可以关注该作者的另一个单独介绍的视频，油管链接。

but you get two string format 'NAN', you want get a really NAN value, okay, let fix it:

1 2	import numpy as np ri.loc[ri.stop_duration == 'NAN','stop_duration'] = np.nan

there you go!

 0-15 Min     69543
16-30 Min    13635
30+ Min       3228
Name: stop_duration, dtype: int64

Ploting

导入相关包

from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

导入数据

1
2
3

ri = pd.read_csv("C:/Users/Fan.ZhangFZ/Desktop/police.csv")
ri.drop('county_name',axis='columns',inplace=True)
ri.dtypes

年龄分布状况

ageinfo = ri.driver_age.value_counts()

fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(1,1,1)
ax.bar(ageinfo.index,ageinfo.values,color='g')
ax.set_xlabel('Driver age')
ax.set_ylabel('Number of driver')
xticks = ax.set_xticks(np.arange(15,100,5))
plt.show()

搜查时间

stoped_datetime = ri.stop_date.str.cat(ri.stop_time,sep=' ')
ri['stoped_datetime'] = pd.to_datetime(stoped_datetime)
year_counts = ri.stoped_datetime.dt.year.value_counts().sort_index()
hour_counts = ri.stoped_datetime.dt.hour.value_counts().sort_index()

fig = plt.figure(figsize=(12,14))
ax1 = fig.add_subplot(2,1,1)
ax1.plot(year_counts.index,year_counts.values,marker='o',color='y',linestyle='--')
ax1.set_xlabel('Years')
ax1.set_ylabel('Search Count')
ax1.set_title("Search rate in diffrent years")

ax2 = fig.add_subplot(2,1,2)
ax2.set_xlabel('Hours')
ax2.set_ylabel('Search Count')
ax2.plot(hour_counts.index,hour_counts.values,marker='o',color='b',linestyle='--')
ax2.set_title("Search rate in diffrent hours")
plt.show()

种族情况

driver_race = ri.driver_race.value_counts()
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(1,1,1)
ax.bar(driver_race.index,driver_race.values)

plt.show()

男女超速的比例

1
2
3

ri[ri.violation == "Speeding"].driver_gender.value_counts().plot.pie(figsize=(8, 8),labels=['Men', 'Female'],autopct='%.2f')
plt.show()

男性违章类型占比


ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True).plot.pie(figsize=(8, 8),autopct='%.2f')
plt.show()

女性违章占比

1
2
3

ri[ri.driver_gender == 'F'].violation.value_counts(normalize=True).plot.pie(figsize=(8, 8),autopct='%.2f')
plt.show()

合并展示

data = ri.groupby(['driver_gender']).violation.value_counts(normalize=True)*100
fig, ax = plt.subplots(figsize=(15,7))
# use unstack()
data.unstack().plot.bar(ax=ax,stacked=True)
plt.show()

被搜查人员为男、女的概率

data = ri.groupby('driver_gender').search_conducted.mean()
fig,ax = plt.subplots(figsize=(5,7))
data.plot.pie(ax=ax,autopct='%.2f')
plt.show()

违章类型占比

search_type = ri.search_type.value_counts()
fig,ax = plt.subplots(figsize=(6,10))
search_type.plot.barh(color='r')
plt.show()

哪一年违章最多

1
2
3

ri.stop_date.str.slice(0,4).value_counts().sort_index().plot.barh(figsize=(12,8),color='y')
plt.show()

一天中什么时候毒驾率最高？

drug_hour_rate = ri.groupby(ri.stoped_datetime.dt.hour).drugs_related_stop.mean() * 100
fig,ax = plt.subplots(figsize=(15,7))
xticks = ax.set_xticks(np.arange(0,24,1))
ax.set_title('Rate for drugs related stop in a day')
ax.set_ylabel('Rate,%')
ax.set_xlabel('Hours')
drug_hour_rate.plot(ax=ax)
plt.show()

生词

search：这里应该理解为警察的搜查
hypothetical: 假设
slick way,精明的小技巧
registration searched: 车牌造假违章
conducted: 处理、行为