一次偶然的机会看到王树义老师的推荐YouTube上的这个pandas视频教程,于是花了点时间学习了一下,也就有了这篇笔记,因为课程是英文的,也没有提供 中文字幕,所以我的笔记也就懒得再用中文写了。
利用Pandas分析美国交警开放的搜查数据
要求
The tutorial will not cover pandas basics. If you are new to pandas or just need a refresher, I recommend watching some videos from my free pandas course. Alternatively, you can review all of the code from my pandas course in this Jupyter notebook.
参考
开始
DownLoad the data
原始数据
教程使用的数据
About the data
- For each state in the dataset, They provide data in three formats(We use csv format)
- If a column cannot be computed using the data a state has provided, it is set to NA.
- For several fields (e.g., driver_race) we include a "raw" column which records the original data values from which we infer standardized values
We include the raw columns because our data processing pipeline is extensive, requiring judgment calls and subjective decisions.
Column name | Column meaning | Example value |
id | The unique ID we assign to each stop. Contains the state and year. | VT-2011-00012 |
state(州) | The two-letter code for the state in which the stop occurred. | VT |
stop_date(被搜查的日期) | The date of the stop, in YYYY-MM-DD format. Some states do not provide the exact stop date: for example, they only provide the year or quarter in which the stop occurred. For these states, stop_date is set to the date at the beginning of the period: for example, January 1 if only year is provided. | 2011-11-27 |
stop_time(被警察搜查时的时间) | The 24-hour time of the stop, in HH:MM format. | 20:15 |
location_raw | The original data value from which we compute the county (or comparably granular location) in which the stop occurred. Not in a standardized format across states. | Winooski |
county_name | The standardized name of the county in which the stop occurred. | Chittenden County |
county_fips | The standardized 5-digit FIPS code in which the stop occurred. | 50007 |
district | In several states (e.g., Illinois) the stop county cannot be inferred, but a comparably granular location can. This comparably granular location is stored in the district column. Most states do not have this column. | ILLINOIS STATE POLICE 01 |
fine_grained_location | Any higher-resolution data about where the stop occurred: e.g., milepost or address. Not standardized across states. | 90400 I 89 N; EXIT 15 MM90/40 |
police_department | The police department or agency that made the stop. Not in a standard format across states. | WILLISTON VSP |
driver_gender | The driver’s gender, as recorded by the trooper. M, F, or NA. | M |
driver_age_raw | The original data value from which we compute the driver’s age when they were stopped. May be age, birth year, or birth date. Not in a standard format across states. | 1988 |
driver_age | The driver’s age when they were stopped. Set to NA if less than 15 or greater than or equal to 100. | 23 |
driver_race_raw | The original data value from which the driver’s standardized race is computed. Not in a standard format across states. | African American |
driver_race | The standardized driver race. Possible values are White, Black, Hispanic, Asian, Other, and NA, with NA denoting values which are unknown. Asian refers to Asian, Pacific Islander, and Indian. Native Americans/American Indians are included in the "other" category. Anyone with Hispanic ethnicity is classified as Hispanic, regardless of their recorded race. | Black |
violation_raw | The violation committed by the driver, in the language of the original data. Not in a standard format across states. Some stops have multiple violations. | Speeding (10–19 MPH Over Prima Facie Limit *) |
violation | The violation committed by the driver, standardized into categories which are consistent across states. | Speeding |
search_conducted(是否被搜查) | A TRUE/FALSE value indicating whether a search was performed. | TRUE |
search_type_raw | The justification for the search, in the language of the original data. NA if no search was performed. Not in a standard format across states. Some states have multiple justifications for a search. | CONSENT SEARCH CONDUCTED |
search_type | The normalized justification for the search. Where possible, this is standardized into categories which are consistent across states. For example, if something is clearly a consent search, search_type is referred to as “Consent”. | Consent |
contraband_found | A TRUE/FALSE value indicating whether a search was performed and contraband was found. FALSE if no search was performed. | TRUE |
stop_outcome | The outcome of the stop. Many states have idiosyncratic outcomes — for example, “CHP 215” in California — so this column is not standardized across states. “Citation” and “Warning” are the values which occur most commonly across states. If the stop has multiple outcomes, the most severe outcome is used. For example, if a stop resulted in a citation and a warning, stop_outcome would be “Citation”. | Citation |
is_arrested | A TRUE/FALSE value indicating whether an arrest was made. | TRUE |
- FIPS:联邦信息处理标准(英语:Federal Information Processing Standards,缩写FIPS)
- VT: 美国的一个州
- M,F: 记录性别,M男,F女
- race: 可不是速度或者竞赛,而是种族,参考分类如下,与数据里的可能不一样,了解更多
- American Indian or Alaska Native
- Asian
- Black or African American
- Native Hawaiian or Other Pacific Islander
- White
- violation:违章
- NAN means the value is missing or irrelevant for that data row.
Know the data
1 | pd.read_csv() # 读取csv文件 |
Import the data and check the dtypes:
1 | ri = pd.read_csv("/Users/wittyfans/Desktop/pycon-2018-tutorial/police.csv") |
Output:
1 | stop_date object |
可以看到有多少列,以及这列的数据类型,对数据有一个基本的了解。
年龄占比
种族占比
毒驾被查时段分布
男性违章类别前十
女性违章类别前十
男女违章类别对比
查获违章数(按年)
违章查获时间段与年份
Handle the data
1. Remove the column that only contains missing values
Drop column
1 | ri.drop('value',aixs='columns',inplace=True) |
There is another way to drop the columns, like:
1 | del ri['columns_name'] |
you can do that as this, that's fine, but it's not the Pandas way.
Tips:
- Pay attention to default arguments
- check your work
- There is more than one way to do everything in pandas
2. Do men or women speed more often?
driver_gender,violaton
1 | ri[ri.violation == "Speeding"].driver_gender.value_counts() |
output:
1 | M 32979 |
1 | ri[ri.violation == "Speeding"].driver_gender.value_counts(normalize=True) # get the percent value, 如果一个人超速,那么多大几率是男性或女性 |
output:
1 | M 0.680527 |
反过来看,女性:
1 | ri[ri.driver_gender == 'F'].violation.value_counts(normalize=True) |
output:
1 | Speeding 0.658500 |
男性:
1 | ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True) |
output: 1
2
3
4
5
6
7Speeding 0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64
合并成一条命令:
1 | ri.groupby('driver_gender').violation.value_counts(normalize=True) |
output:
1 | driver_gender violation |
when you wirte the groupby, just think for each value(in here is gender, man or women).
- count(): 返回所有数据统计之和
- value_counts(): 分别统计该数据分类下(如果有)的值
- normalize = True: 该选项等于True,则会返回 relative frequencies of the unique values.
3. Does gender affect who gets searched during a stop?
driver_gender,search_conducted
1 | ri.search_conducted.value_counts(normalize=True) |
output:
1 | False 0.965163 |
Same results with above, how does that work? If the serious value is bool type and you can always just get a sum to konw the number of ones, or a mean to get the percentage of ones, that will simplify you code.
1 | ri.search_conducted.mean() |
output:
1 | 0.03483720473942948 |
groupby gender:
1 | ri.groupby('driver_gender').search_conducted.mean() |
output:
1 | driver_gender |
Does that prove any things? No, - Causation is difficult to conclude, so focus on relationships. - Include all relevant factors when studying a relationship.
Let's see details: 1
ri.groupby(['driver_gender','violation']).search_conducted.mean()
output:
1 | driver_gender violation |
4. Why is search_type missing so often?
1 | ri.search_conducted.value_counts() |
output:
1 | False 88545 |
search_conducted means a search performed, 即一个车主被警察查了。
1 | ri.search_type.count() |
output: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25Incident to Arrest 1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Probable Cause,Protective Frisk 33
Incident to Arrest,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Incident to Arrest,Inventory,Protective Frisk 11
Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Reasonable Suspicion 2
Inventory,Probable Cause,Protective Frisk 2
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64
search_type, the reason for search performed, 即警察为什么要查这个车。
1 | ri[ri.search_conducted == False].search_type.value_counts() |
output:
1 | Series([], Name: search_type, dtype: int64) |
return empty series, why? because there is no value to count, nan and null is not a value, if it's not what you want,add dropna= False in the value_counts() method likt this:
1 | ri[ri.search_conducted == False].search_type.value_counts(dropna=False) |
5. During a search, how often is the driver frisked?
How to know is there a word in a string in Python? You may use 'a' in b, this is the pythoy way, we got pandas way like this:
1 |
|
针对每一行,判断是否有字符串在其中,有则返回True,无则False.
1 | ri['frisk'] = ri.search_type.str.contains('Probable Cause') |
output: 1
2
3
4
5
6
7
8
90 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
...
1 | ri.frisk.value_counts(dropna=False) |
output:
1 | NaN 88545 |
1 | ri.frisk.sum() -> 1182 |
the mean, is caculated by frisked / (frisked+unfrisked), and will excluded the missing values.
6. Which year had the least number of stops?
Method 1: Use the str.slice
1 | ri.stop_date.str.slice(0,4).value_counts() |
output:
1 | 2012 10970 |
The stop_date format is: 2005-01-02, so slice(0,4) will return 2005, then do a value_counts will get the number of stops in all years.
Method 2: Use the datetime
1 | combined = ri.stop_date.str.cat(ri.stop_time,sep=' ') |
output:
1 | 2012 10970 |
Use the pd.to_datetime method to convert a concated string to a datatime type value, then add a new col in ri. cat is a string method to concat two string, the sep arg is a string or char you want insert in center.
So, after that, how to get the year value?
1 | ri.stop_datetime.dt.year.value_counts() |
datetime has a method .dt, and you can .year, .day, etc.
7. How does drug activity change by time of day?
stop_datetime,drugs_related_stop
Let's specific, for every hour, what is the rate of the drugs_related_stop?
1 | ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean() |
output:
1 | stop_datetime |
show it!
1 | ri.groupby(ri.stop_datetime.dt.hour).drugs_related_stop.mean().plot() |
output:
if you want the notebook auto show the img, make sure you imported the packge.
1 | import matplotlib.pyplot as plt |
then when you type .plot() the img will show automatically.
Lesssons: - Consider removing chunks of data that may be biased - Use the datetime data type for dates and times
8. Do most stops occur at night?
Let do this:
1 | ri.stop_datetime.dt.hour.value_counts().plot() |
you will get an error img, because the plot is generated by the value, not the index(in here is the hour), so you need to sort with index first,
1 | ri.stop_datetime.dt.hour.value_counts().sort_index().plot() |
ouput:
and there is a details version:
bt, how to define the night, In my view 22-4 is night, I want to filter the hours:
1 |
|
if you have a category, want to apply the hours, you can define a method and map to the hours.
9. Find the bad data in the stop_duration column and fix it
Check this value counts:
1 | ri.stop_duration.value_counts() |
output:
1 | 0-15 Min 69543 |
We want to turn the stop_duration value equals '1' to 'NAN', how?
1 | ri[(ri.stop_duration == '1') | (ri.stop_duration == '2')].stop_duration == 'NAN' |
that looks make sense,but check the value counts, it does'nt work.
1 | 0-15 Min 69543 |
why? acturlly, you may get a warnning,
A value is trying to be set on a copy of a slice from a DataFrame.
You should use loc.
1 | ri.loc[(ri.stop_duration == '1') | (ri.stop_duration == '2'),'stop_duration'] = 'NAN' |
check again:
1 | 0-15 Min 69543 |
注意比较一下这两者用法上的区别,如果想要了解更多loc的用法,可以关注该作者的另一个单独介绍的视频,油管链接。
but you get two string format 'NAN', you want get a really NAN value, okay, let fix it:
1 | import numpy as np |
there you go!
1
2
3
4
5 0-15 Min 69543
16-30 Min 13635
30+ Min 3228
Name: stop_duration, dtype: int64
Ploting
导入相关包
1 | from datetime import datetime |
导入数据
1 | ri = pd.read_csv("C:/Users/Fan.ZhangFZ/Desktop/police.csv") |
年龄分布状况
1 | ageinfo = ri.driver_age.value_counts() |
搜查时间
1 | stoped_datetime = ri.stop_date.str.cat(ri.stop_time,sep=' ') |
种族情况
1 | driver_race = ri.driver_race.value_counts() |
男女超速的比例
1 | ri[ri.violation == "Speeding"].driver_gender.value_counts().plot.pie(figsize=(8, 8),labels=['Men', 'Female'],autopct='%.2f') |
男性违章类型占比
1 |
|
女性违章占比
1 | ri[ri.driver_gender == 'F'].violation.value_counts(normalize=True).plot.pie(figsize=(8, 8),autopct='%.2f') |
合并展示
1 | data = ri.groupby(['driver_gender']).violation.value_counts(normalize=True)*100 |
被搜查人员为男、女的概率
1 | data = ri.groupby('driver_gender').search_conducted.mean() |
违章类型占比
1 | search_type = ri.search_type.value_counts() |
哪一年违章最多
1 | ri.stop_date.str.slice(0,4).value_counts().sort_index().plot.barh(figsize=(12,8),color='y') |
一天中什么时候毒驾率最高?
1 | drug_hour_rate = ri.groupby(ri.stoped_datetime.dt.hour).drugs_related_stop.mean() * 100 |
生词
- search:这里应该理解为警察的搜查
- hypothetical: 假设
- slick way,精明的小技巧
- registration searched: 车牌造假违章
- conducted: 处理、行为