Scott's Blog

学则不固, 知则不惑

0%

利用 Pandas 分析美国交警开放的搜查数据

一次偶然的机会看到王树义老师的推荐YouTube上的这个pandas视频教程,于是花了点时间学习了一下,也就有了这篇笔记,因为课程是英文的,也没有提供 中文字幕,所以我的笔记也就懒得再用中文写了。

利用Pandas分析美国交警开放的搜查数据

要求

The tutorial will not cover pandas basics. If you are new to pandas or just need a refresher, I recommend watching some videos from my free pandas course. Alternatively, you can review all of the code from my pandas course in this Jupyter notebook.

参考

开始

DownLoad the data

原始数据

教程使用的数据

About the data

  • For each state in the dataset, They provide data in three formats(We use csv format)
  • If a column cannot be computed using the data a state has provided, it is set to NA.
  • For several fields (e.g., driver_race) we include a "raw" column which records the original data values from which we infer standardized values

We include the raw columns because our data processing pipeline is extensive, requiring judgment calls and subjective decisions.

Column name Column meaning Example value
id The unique ID we assign to each stop. Contains the state and year. VT-2011-00012
state(州) The two-letter code for the state in which the stop occurred. VT
stop_date(被搜查的日期) The date of the stop, in YYYY-MM-DD format. Some states do not provide the exact stop date: for example, they only provide the year or quarter in which the stop occurred. For these states, stop_date is set to the date at the beginning of the period: for example, January 1 if only year is provided. 2011-11-27
stop_time(被警察搜查时的时间) The 24-hour time of the stop, in HH:MM format. 20:15
location_raw The original data value from which we compute the county (or comparably granular location) in which the stop occurred. Not in a standardized format across states. Winooski
county_name The standardized name of the county in which the stop occurred. Chittenden County
county_fips The standardized 5-digit FIPS code in which the stop occurred. 50007
district In several states (e.g., Illinois) the stop county cannot be inferred, but a comparably granular location can. This comparably granular location is stored in the district column. Most states do not have this column. ILLINOIS STATE POLICE 01
fine_grained_location Any higher-resolution data about where the stop occurred: e.g., milepost or address. Not standardized across states. 90400 I 89 N; EXIT 15 MM90/40
police_department The police department or agency that made the stop. Not in a standard format across states. WILLISTON VSP
driver_gender The driver’s gender, as recorded by the trooper. M, F, or NA. M
driver_age_raw The original data value from which we compute the driver’s age when they were stopped. May be age, birth year, or birth date. Not in a standard format across states. 1988
driver_age The driver’s age when they were stopped. Set to NA if less than 15 or greater than or equal to 100. 23
driver_race_raw The original data value from which the driver’s standardized race is computed. Not in a standard format across states. African American
driver_race The standardized driver race. Possible values are White, Black, Hispanic, Asian, Other, and NA, with NA denoting values which are unknown. Asian refers to Asian, Pacific Islander, and Indian. Native Americans/American Indians are included in the "other" category. Anyone with Hispanic ethnicity is classified as Hispanic, regardless of their recorded race. Black
violation_raw The violation committed by the driver, in the language of the original data. Not in a standard format across states. Some stops have multiple violations. Speeding (10–19 MPH Over Prima Facie Limit *)
violation The violation committed by the driver, standardized into categories which are consistent across states. Speeding
search_conducted(是否被搜查) A TRUE/FALSE value indicating whether a search was performed. TRUE
search_type_raw The justification for the search, in the language of the original data. NA if no search was performed. Not in a standard format across states. Some states have multiple justifications for a search. CONSENT SEARCH CONDUCTED
search_type The normalized justification for the search. Where possible, this is standardized into categories which are consistent across states. For example, if something is clearly a consent search, search_type is referred to as “Consent”. Consent
contraband_found A TRUE/FALSE value indicating whether a search was performed and contraband was found. FALSE if no search was performed. TRUE
stop_outcome The outcome of the stop. Many states have idiosyncratic outcomes — for example, “CHP 215” in California — so this column is not standardized across states. “Citation” and “Warning” are the values which occur most commonly across states. If the stop has multiple outcomes, the most severe outcome is used. For example, if a stop resulted in a citation and a warning, stop_outcome would be “Citation”. Citation
is_arrested A TRUE/FALSE value indicating whether an arrest was made. TRUE

Know the data

1
2
3
4
5
6
pd.read_csv() # 读取csv文件
pd.head() # 查看表头
pd.tail() # 查看表尾
pd.dtypes # 查看列的数据类型
pd.isnull().sum() # 查看null值数量

Import the data and check the dtypes:

1
2
ri = pd.read_csv("/Users/wittyfans/Desktop/pycon-2018-tutorial/police.csv")
ri.dtypes

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
stop_date                     object
stop_time object
driver_gender object
driver_age_raw float64
driver_age float64
driver_race object
violation_raw object
violation object
search_conducted bool
search_type object
stop_outcome object
is_arrested object
stop_duration object
drugs_related_stop bool
frisk object
stop_datetime datetime64[ns]
dtype: object

可以看到有多少列,以及这列的数据类型,对数据有一个基本的了解。

年龄占比

年龄分布图

种族占比

种族占比

毒驾被查时段分布

毒驾被查时段分布

男性违章类别前十

女性违章类别前十

男女违章类别对比

男女违章类别对比

查获违章数(按年)

查获违章数

违章查获时间段与年份

违章查获时间段分布

Handle the data

1. Remove the column that only contains missing values

Drop column

1
2
3
4
5
6
7
ri.drop('value',aixs='columns',inplace=True)

# aixs = 'columns'即我们要进行的是列操作,有时候也用aixs = 1代替。
# inplace = True,即处理完后替代原来的数据,相当于下面这种写法:

ri = ri.drop('county_name',axis=1)

There is another way to drop the columns, like:

1
2
del ri['columns_name']

you can do that as this, that's fine, but it's not the Pandas way.

Tips:

  • Pay attention to default arguments
  • check your work
  • There is more than one way to do everything in pandas

2. Do men or women speed more often?

driver_gender,violaton

1
ri[ri.violation == "Speeding"].driver_gender.value_counts() 

output:

1
2
3
M    32979
F 15482
Name: driver_gender, dtype: int64
1
2
ri[ri.violation == "Speeding"].driver_gender.value_counts(normalize=True) # get the percent value, 如果一个人超速,那么多大几率是男性或女性

output:

1
2
3
M    0.680527
F 0.319473
Name: driver_gender, dtype: float64

反过来看,女性:

1
ri[ri.driver_gender == 'F'].violation.value_counts(normalize=True)

output:

1
2
3
4
5
6
7
Speeding               0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
Name: violation, dtype: float64

男性:

1
ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True)

output:

1
2
3
4
5
6
7
Speeding               0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64

合并成一条命令:

1
ri.groupby('driver_gender').violation.value_counts(normalize=True)

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
driver_gender  violation          
F Speeding 0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
M Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64

when you wirte the groupby, just think for each value(in here is gender, man or women).

  • count(): 返回所有数据统计之和
  • value_counts(): 分别统计该数据分类下(如果有)的值
  • normalize = True: 该选项等于True,则会返回 relative frequencies of the unique values.

3. Does gender affect who gets searched during a stop?

driver_gender,search_conducted

1
ri.search_conducted.value_counts(normalize=True)

output:

1
2
3
False    0.965163
True 0.034837
Name: search_conducted, dtype: float64

Same results with above, how does that work? If the serious value is bool type and you can always just get a sum to konw the number of ones, or a mean to get the percentage of ones, that will simplify you code.

1
ri.search_conducted.mean()

output:

1
0.03483720473942948

groupby gender:

1
2
ri.groupby('driver_gender').search_conducted.mean()

output:

1
2
3
4
driver_gender
F 0.020033
M 0.043326
Name: search_conducted, dtype: float64

Does that prove any things? No, - Causation is difficult to conclude, so focus on relationships. - Include all relevant factors when studying a relationship.

Let's see details:

1
ri.groupby(['driver_gender','violation']).search_conducted.mean()

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
driver_gender  violation          
F Equipment 0.042622
Moving violation 0.036205
Other 0.056522
Registration/plates 0.066140
Seat belt 0.012598
Speeding 0.008720
M Equipment 0.070081
Moving violation 0.059831
Other 0.047146
Registration/plates 0.110376
Seat belt 0.037980
Speeding 0.024925
Name: search_conducted, dtype: float64

4. Why is search_type missing so often?

1
2
ri.search_conducted.value_counts()

output:

1
2
3
4
False    88545
True 3196
Name: search_conducted, dtype: int64

search_conducted means a search performed, 即一个车主被警察查了。

1
2
ri.search_type.count()

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Incident to Arrest                                          1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Probable Cause,Protective Frisk 33
Incident to Arrest,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Incident to Arrest,Inventory,Protective Frisk 11
Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Reasonable Suspicion 2
Inventory,Probable Cause,Protective Frisk 2
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64

search_type, the reason for search performed, 即警察为什么要查这个车。

1
ri[ri.search_conducted == False].search_type.value_counts()

output:

1
Series([], Name: search_type, dtype: int64)

return empty series, why? because there is no value to count, nan and null is not a value, if it's not what you want,add dropna= False in the value_counts() method likt this:

1
ri[ri.search_conducted == False].search_type.value_counts(dropna=False)

5. During a search, how often is the driver frisked?

How to know is there a word in a string in Python? You may use 'a' in b, this is the pythoy way, we got pandas way like this:

1
2
3

ri.search_type.str.contains()

针对每一行,判断是否有字符串在其中,有则返回True,无则False.

1
2
ri['frisk'] = ri.search_type.str.contains('Probable Cause')
ri.frisk

output:

1
2
3
4
5
6
7
8
9
0          NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
...

1
ri.frisk.value_counts(dropna=False)

output:

1
2
3
4
NaN      88545
False 2014
True 1182
Name: frisk, dtype: int64
1
2
3
ri.frisk.sum() -> 1182
ri.frisk.mean() -> 0.36983729662077597
1182/(1182+2014) -> 0.36983729662077597

the mean, is caculated by frisked / (frisked+unfrisked), and will excluded the missing values.

6. Which year had the least number of stops?

Method 1: Use the str.slice

1
2
ri.stop_date.str.slice(0,4).value_counts()

output:

1
2
3
4
5
6
7
8
9
10
11
12
2012    10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558

The stop_date format is: 2005-01-02, so slice(0,4) will return 2005, then do a value_counts will get the number of stops in all years.

Method 2: Use the datetime

1
2
combined = ri.stop_date.str.cat(ri.stop_time,sep=' ')
ri['stop_datetime'] = pd.to_datetime(combined)

output:

1
2
3
4
5
6
7
8
9
10
11
2012    10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558

Use the pd.to_datetime method to convert a concated string to a datatime type value, then add a new col in ri. cat is a string method to concat two string, the sep arg is a string or char you want insert in center.

So, after that, how to get the year value?

1
ri.stop_datetime.dt.year.value_counts()

datetime has a method .dt, and you can .year, .day, etc.

7. How does drug activity change by time of day?

Let's specific, for every hour, what is the rate of the drugs_related_stop?

1
ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean()

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
stop_datetime
0 0.019728
1 0.013507
2 0.015462
3 0.017065
4 0.011811
5 0.004762
6 0.003040
7 0.003281
8 0.002687
9 0.006288
10 0.005714
11 0.006976
12 0.004467
13 0.010326
14 0.007810
15 0.006416
16 0.005723
17 0.005517
18 0.010148
19 0.011596
20 0.008084
21 0.013342
22 0.013533
23 0.016344
Name: drugs_related_stop, dtype: float64

show it!

1
ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean().plot()

output:

if you want the notebook auto show the img, make sure you imported the packge.

1
2
3
import matplotlib.pyplot as plt
%matplotlib inline

then when you type .plot() the img will show automatically.

Lesssons: - Consider removing chunks of data that may be biased - Use the datetime data type for dates and times

8. Do most stops occur at night?

Let do this:

1
2
ri.stop_datetime.dt.hour.value_counts().plot()

you will get an error img, because the plot is generated by the value, not the index(in here is the hour), so you need to sort with index first,

1
2
ri.stop_datetime.dt.hour.value_counts().sort_index().plot()

ouput:

and there is a details version:

bt, how to define the night, In my view 22-4 is night, I want to filter the hours:

1
2
3

ri[(ri.stop_datetime.dt.hour > 4) & (ri.stop_datetime.dt.hour < 22)].shape

if you have a category, want to apply the hours, you can define a method and map to the hours.

9. Find the bad data in the stop_duration column and fix it

Check this value counts:

1
ri.stop_duration.value_counts()

output:

1
2
3
4
5
6
0-15 Min     69543
16-30 Min 13635
30+ Min 3228
2 1
1 1
Name: stop_duration, dtype: int64

We want to turn the stop_duration value equals '1' to 'NAN', how?

1
2
ri[(ri.stop_duration == '1') | (ri.stop_duration == '2')].stop_duration == 'NAN'

that looks make sense,but check the value counts, it does'nt work.

1
2
3
4
5
6
0-15 Min     69543
16-30 Min 13635
30+ Min 3228
2 1
1 1
Name: stop_duration, dtype: int64

why? acturlly, you may get a warnning,

A value is trying to be set on a copy of a slice from a DataFrame.

You should use loc.

1
ri.loc[(ri.stop_duration == '1') | (ri.stop_duration == '2'),'stop_duration'] = 'NAN'

check again:

1
2
3
4
5
6
0-15 Min     69543
16-30 Min 13635
30+ Min 3228
NAN 2
Name: stop_duration, dtype: int64

注意比较一下这两者用法上的区别,如果想要了解更多loc的用法,可以关注该作者的另一个单独介绍的视频,油管链接

but you get two string format 'NAN', you want get a really NAN value, okay, let fix it:

1
2
import numpy as np
ri.loc[ri.stop_duration == 'NAN','stop_duration'] = np.nan

there you go!

1
2
3
4
5
 0-15 Min     69543
16-30 Min 13635
30+ Min 3228
Name: stop_duration, dtype: int64

Ploting

导入相关包

1
2
3
4
5
from datetime import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

导入数据

1
2
3
ri = pd.read_csv("C:/Users/Fan.ZhangFZ/Desktop/police.csv")
ri.drop('county_name',axis='columns',inplace=True)
ri.dtypes

年龄分布状况

1
2
3
4
5
6
7
8
9
10
ageinfo = ri.driver_age.value_counts()

fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(1,1,1)
ax.bar(ageinfo.index,ageinfo.values,color='g')
ax.set_xlabel('Driver age')
ax.set_ylabel('Number of driver')
xticks = ax.set_xticks(np.arange(15,100,5))
plt.show()

搜查时间

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
stoped_datetime = ri.stop_date.str.cat(ri.stop_time,sep=' ')
ri['stoped_datetime'] = pd.to_datetime(stoped_datetime)
year_counts = ri.stoped_datetime.dt.year.value_counts().sort_index()
hour_counts = ri.stoped_datetime.dt.hour.value_counts().sort_index()

fig = plt.figure(figsize=(12,14))
ax1 = fig.add_subplot(2,1,1)
ax1.plot(year_counts.index,year_counts.values,marker='o',color='y',linestyle='--')
ax1.set_xlabel('Years')
ax1.set_ylabel('Search Count')
ax1.set_title("Search rate in diffrent years")

ax2 = fig.add_subplot(2,1,2)
ax2.set_xlabel('Hours')
ax2.set_ylabel('Search Count')
ax2.plot(hour_counts.index,hour_counts.values,marker='o',color='b',linestyle='--')
ax2.set_title("Search rate in diffrent hours")
plt.show()

种族情况

1
2
3
4
5
6
7
driver_race = ri.driver_race.value_counts()
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(1,1,1)
ax.bar(driver_race.index,driver_race.values)

plt.show()

男女超速的比例

1
2
3
ri[ri.violation == "Speeding"].driver_gender.value_counts().plot.pie(figsize=(8, 8),labels=['Men', 'Female'],autopct='%.2f')
plt.show()

男性违章类型占比

1
2
3
4

ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True).plot.pie(figsize=(8, 8),autopct='%.2f')
plt.show()

女性违章占比

1
2
3
ri[ri.driver_gender == 'F'].violation.value_counts(normalize=True).plot.pie(figsize=(8, 8),autopct='%.2f')
plt.show()

合并展示

1
2
3
4
5
data = ri.groupby(['driver_gender']).violation.value_counts(normalize=True)*100
fig, ax = plt.subplots(figsize=(15,7))
# use unstack()
data.unstack().plot.bar(ax=ax,stacked=True)
plt.show()

被搜查人员为男、女的概率

1
2
3
4
data = ri.groupby('driver_gender').search_conducted.mean()
fig,ax = plt.subplots(figsize=(5,7))
data.plot.pie(ax=ax,autopct='%.2f')
plt.show()

违章类型占比

1
2
3
4
search_type = ri.search_type.value_counts()
fig,ax = plt.subplots(figsize=(6,10))
search_type.plot.barh(color='r')
plt.show()

哪一年违章最多

1
2
3
ri.stop_date.str.slice(0,4).value_counts().sort_index().plot.barh(figsize=(12,8),color='y')
plt.show()

一天中什么时候毒驾率最高?

1
2
3
4
5
6
7
8
9
drug_hour_rate = ri.groupby(ri.stoped_datetime.dt.hour).drugs_related_stop.mean() * 100
fig,ax = plt.subplots(figsize=(15,7))
xticks = ax.set_xticks(np.arange(0,24,1))
ax.set_title('Rate for drugs related stop in a day')
ax.set_ylabel('Rate,%')
ax.set_xlabel('Hours')
drug_hour_rate.plot(ax=ax)
plt.show()

生词

  • search:这里应该理解为警察的搜查
  • hypothetical: 假设
  • slick way,精明的小技巧
  • registration searched: 车牌造假违章
  • conducted: 处理、行为