> 为什么要使用Numpy？给你两组数据运算，然后对比一下性能就知道了.

Why Numpy?

import numpy as np

my_arr = np.arange(1000000)

my_list = list(range(1000000))

现在对两组数乘以2

%time for _ in range(10): my_arr2 = my_arr * 2

CPU times: user 20 ms, sys: 50 ms, total: 70 ms Wall time: 72.4 ms

%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

CPU times: user 760 ms, sys: 290 ms, total: 1.05 s

Numpy

生成随机数

1 2	import numpy as np data = np.random.randn(2,3)

数据长这样：

1 2	array([[-0.05094946, -1.54555805, -1.19695135], [-1.06169454, 1.13763682, 0.57538678]])

将它们乘10:

1 2	array([[ 9.41882893, 3.20674452, 18.05866858], [ -7.97835594, -9.56449228, -0.83342424]])

两份数据相加：

1 2	array([[-2.37617968, 3.45388874, -0.64218591], [-2.99399147, -1.11118452, -1.11992404]])

对于Numpy的数据：

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:

所有的数据必须是同样的类型
每个数组都有一个元组类型的shape属性，表示这个数组的维度信息
每个数组都有一个dtype属性用来描述它其中的数据类型

如上面的data：

1 2	data.shape ---> (2, 3) data.dtype --> dtype('float64')

While it’s not necessary to have a deep understanding of NumPy for many data analytical applications, becoming proficient in array-oriented programming and thinking is a key step along the way to becoming a scientific Python guru.

创建 NDarrays

直接从数组创建：

1
2
3

data1 = [6,7.5,8,0,1]
arr1 = np.array(data1)
arr1

也可以从多维数组创建：

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

arr2.shape --> (2,4)
arr2.dim -->2

可以用 ndim 属性来看数组的维度信息。

Numpy还有一些有趣的方法，可以直接创建0和1，或者空值：

1
2
3

np.zeros(10) --> 创建10个0的数组
np.zeros((3,6))
np.empty((2,3,2)) --> 创建两个两列三行的数组

还有一些创建ndarrays的方法：

array: Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype
asarray: Convert input to ndarray, but do not copy if the input is already an ndarray
arange: Like the built-in range but returns an ndarray instead of a list

更多的方法可以参考：Python for data analyse, Table 4-1

NDarrays的数据类型

ndarrays 的data type或者是dtype包含了一些基本的信息(meta),array在定义的时候是可以指定数据类型的，比如：

1 2	arr1 = np.array([1,2,3],dtype=np.float64) arr2 = np.array([1,2,3],dtype=np.int32)

数据类型可以相互转化：

arr = np.array([1,2,3,4])
arr.dtype
# 输出 dtype('int64')

arr = np.array([1,2,3,4])
floatarr = arr.astype(np.float64)
floatarr.dtype
# 输出 dtype('float64')

相反的float也可以转化成 int,十进制多出来的部分会被四舍五入。

数组运算

下面是基本的运算：

1	arr = np.array([[1., 2., 3.], [4., 5., 6.]])

乘：

arr*arr
# out
array([[  1.,   4.,   9.],
       [ 16.,  25.,  36.]])

减:

arr-arr
# out
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

所有的运算都是基于相对关系的，记住这一点即可。除此之外，np还支持比较，假设两个arr对比，返回结果会是一个包含true或false的数组。

切片和索引

Numpy的切片和索引和数组的差不多，切片就是按照坐标或者坐标范围来找出对应，或对应范围内的值，根据坐标来理解就很简单

你可以对一个切片范围内的值重新赋值：

arr = np.arange(10)
arr[5] --> 5
arr[5:8] --> [5,6,7]
arr[5:8] = 12
arr --> array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

np设计需要处理大量的数据，所以对于数组的操作，都是在原来的数据上改动，不会copy。

arr_slice = arr[5:8]
arr_slice
# out: array([5, 6, 7])

arr_slice[:] = 9 # [:]是应用在数组中的所有元素
arr
# out: array([0, 1, 2, 3, 4, 9, 9, 9, 8, 9])

如果你要copy，np提供了一个copy函数:

arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
copyed = arr3d.copy()
copyed
# out:
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

可以在两个维度上切片：

arr2d[:]

1
2
3

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

arr2d[:2]

1 2	array([[1, 2, 3], [4, 5, 6]])

arr2d[:2,1:]

1 2	array([[2, 3], [5, 6]])

arr2d[:,:0]

1
2
3

array([[1],
       [4],
       [7]])

参照下图，动手实践几次，就会懂其中的套路了。

Boolean Indexing

我们有一批名字:

1	names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

我们可以直接通过 names == 'Bob' 来返回一个检查结果，这个结果包含的是一个 bollean 的 list.

names == 'Bob'

1	array([ True, False, False, True, False, False, False], dtype=bool)

如果我们有一份数据，也是7行，那么我们可以吧这个包含 True 和 False 的l List 传进去，这样 Numpy 会选出那些对应 True 的行。

data = np.random.randn(7,4)

array([[-0.71030181, -0.14900916, -1.15238417, -0.49395683],
       [-0.92601472,  0.88452947, -0.9206763 , -0.43338155],
       [-0.68093622,  0.93612942,  0.03261537,  1.44615091],
       [ 1.40919226, -0.07214425, -0.07973205, -1.01432059],
       [-0.4042085 ,  0.66812768,  0.4715137 ,  0.34981598],
       [ 0.89631112, -0.70534677,  0.44560626,  0.6133761 ],
       [-0.28979691,  0.58481489, -0.06945283, -0.99545537]])

data[names=='Bob']

1 2	array([[-0.71030181, -0.14900916, -1.15238417, -0.49395683], [ 1.40919226, -0.07214425, -0.07973205, -1.01432059]])

Boolean selection will not fail if the boolean array is not the correct length, so I recommend care when using this feature.

上面的选择，也可以配合切片：

data[names=='Bob',:1]

1 2	array([[-0.71030181], [ 1.40919226]])

选择除了 Bob 之外的名字：

names != 'Bob'
~(names == 'Bob')

对于 names 的过滤，可以用组合条件：

cond = names == 'Bob'
cond = (names=='Bob') | (names == 'will')

对于 data 也一样：

data[data < 0] = 0

设置整行的值也非常简单：

data[names != 'Joe'] = 7

array([[ 7.        ,  7.        ,  7.        ,  7.        ],
       [-0.92601472,  0.88452947, -0.9206763 , -0.43338155],
       [ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 7.        ,  7.        ,  7.        ,  7.        ],
       [ 0.89631112, -0.70534677,  0.44560626,  0.6133761 ],
       [-0.28979691,  0.58481489, -0.06945283, -0.99545537]])

Fancy Indexing

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays. Suppose we had an 8 × 4 array:

arr = np.empty((8,4),dtype=np.int)
for i in range(8):
    arr[i] = i
arr

选择单个值：

arr[3,0]

out: 3

选择多行：

arr[[3,0]]

1 2	array([[3, 3, 3, 3], [0, 0, 0, 0]])

让我们构建一个按顺序排列的 8x4 的数组：

arr = np.arange(32).reshape((8,4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

按选择前两个子数组的第一个数：

arr[[1,2],[0,0]]

1	array([4, 8])

你也可以对选择出来的数组，进行排序：

arr[[1,2]][:]

1 2	array([[ 4, 5, 6, 7], [ 8, 9, 10, 11]])

arr[[1,2]][:,[0]]

1 2	array([[4], [8]])

arr[[1,2]][:,[0,3,2,1]]

1 2	array([[ 4, 7, 6, 5], [ 8, 11, 10, 9]])

Fancy indexing always copies the data into a new array.

Transposing Arrays and Swapping Axes

Shape

value = 18
x = 2
y = 9
arr = np.arange(value).reshape((x,y))
arr
# 输出
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
       [ 9, 10, 11, 12, 13, 14, 15, 16, 17]])

value = 18
x = 3
y = 6
arr = np.arange(value).reshape((x,y))
# 输出
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])

只需要确保 Value = X * Y 就可以任意 shape 了。

Transposing

看例子：

arr = np.arange(18).reshape((3,6))

1
2
3

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])

arr.T

array([[ 0,  6, 12],
       [ 1,  7, 13],
       [ 2,  8, 14],
       [ 3,  9, 15],
       [ 4, 10, 16],
       [ 5, 11, 17]])

Transposing 在矩阵计算中用的非常多，比如用 np.dot 方法计算矩阵的内积:

arra = np.array([2,3,0])
arrb = np.array([2,-1,1])
np.dot(arra,arrb)
# 输出 1

怎么计算内积？看下图就明白了

Scott's Blog

Numpy NDArray 基本操作