作者:小小明,Pandas数据处理专家,致力于帮助无数数据从业者解决数据处理难题。

这是半年前写的一篇文章,里面涉及的方法可能有些过时,但处理思想仍有较高的参考价值,现在发布到csdn。


有位朋友咨询了我这样一个问题:

image-20200611172140721

然后我最终就通过KNN算法的查找临近节点的函数实现了这个功能,现在分享给大家实现过程。

数据形式

基站数据库数据

import pandas as pd
data = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
                     sheet_name="现网基站数据库",
                     usecols=[1, 2, 3])
data
CELL_NAMELONLAT
0LFZ贵港市电信城北局O4109.5912823.11093
1LFZ贵港市电信城北局O5109.5912823.11093
2LFZ贵港市电信城北局O6109.5912823.11093
3LFZ贵港市电信城北局O10109.5912823.11093
4LFZ贵港市电信城北局O49109.5912823.11093
9514LFZ贵港市人民医院外科楼25F楼梯间I5109.5948223.09028
9515LFZ贵港市人民医院妇科楼13F楼梯间I6109.5948223.09028
9516newLFZ贵港市唐人街室分21I109.5964423.09433
9517newLFZ贵港市唐人街室分22I109.5964423.09433
9518newLFZ贵港市港北区锦泰公馆1栋2单元B1F电梯旁B1FIQ20109.5868723.10925

9519 rows × 3 columns

共9519条示例数据。

需找出最近距离的基站,共16条示例数据:

需找出最近距离的基站数据

find = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
                     sheet_name="需找出最近距离的基站",
                     usecols=[1, 2, 3])
find
基站中文名lonlat
0LFZ贵港港北区唐人街1号楼17F弱电井I4109.59644023.094330
1newLFZ桂平社坡镇禄全村二(村西山头)O19110.19028323.353451
2LFZ桂平市高铁站I10110.11202023.318939
3newLFZ桂平大洋镇什字村二(旺冲良)O56109.98027823.112500
4newLFZ桂平社坡镇禄全村二(村西山头)O17110.19028323.353451
5newLFZ桂平社坡镇禄全村二(村西山头)O18110.19028323.353451
6LFZ贵港市塔山部队装甲团O53109.65194223.106775
7newLFZ贵港黄练镇水村O54109.25343023.190430
8LFZ贵港市世纪经典悉尼座4单元12楼线井I11109.58806023.111116
9LFZ贵港黄练镇新谭三中O57109.27475023.180370
10LFZ贵港奇石乡六马村六良屯O52109.63759323.365757
11newLFZ贵港市体育中心西看台4楼I5109.55916023.115079
12newLFZ贵港市体育中心综合馆I2109.55916023.115079
13LFZ桂平市郁江湾7号楼B1F电井外墙 I8110.07241523.381235
14newLFZ贵港山北乡大王村O53109.41129023.353670
15LFZ贵港市园博园主展馆2FI10109.56126423.076138

现在我们的目标就是找出这16个基站每个基站在基站数据库中最近的10个基站。

常规做法是暴力遍历,那么每个基站都要遍历9519次,当前示例数据还勉强可以接受,但是基站数据库规模一旦达到10万以上,那计算起来就耗时很久了。为了提高计算效率,我利用KNN算法的ball_tree快速计算。

注意:sklearn的KNN算法还提供了brute也可以自定义距离函数,但经过实践发现ball_tree的计算速度会更快一些。

使用KNN分类器计算每个基站最近的10个基站

筛选用于训练的经纬度特征数据

从基站数据库中筛选出经纬度特征数据:

# 从基站数据库中筛选出经纬度特征数据,用于给KNN分类器训练
data_fit = data.iloc[:, [1, 2]]
# y本身用于标注每条数据属于哪个类别,但我并不使用KNN的分类功能,所以统一全部标注为类别1
y = [1] * len(data_fit)
data_fit
LONLAT
0109.5912823.11093
1109.5912823.11093
2109.5912823.11093
3109.5912823.11093
4109.5912823.11093
9514109.5948223.09028
9515109.5948223.09028
9516109.5964423.09433
9517109.5964423.09433
9518109.5868723.10925

9519 rows × 2 columns

筛选需要求出最近10个点的的基站的经纬度特征数据:

# 筛选需要求出最近10个点的的基站的经纬度特征数据
find_x = find.iloc[:, [1, 2]]
find_x.head()
lonlat
0109.59644023.094330
1110.19028323.353451
2110.11202023.318939
3109.98027823.112500
4110.19028323.353451

构建KNN分类器

导入KNN分类器:

# 导入KNN分类器
from sklearn.neighbors import KNeighborsClassifier  #KNN
from math import *

创建用于计算两个经纬度距离的函数:

# 创建用于计算两个经纬度距离的函数
def distancefuc(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(
        radians,
        [float(lon1), float(lat1),
         float(lon2), float(lat2)])  # 经纬度转换成弧度
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    distance = 2 * asin(sqrt(a)) * 6371 * 1000  # 地球平均半径,6371km
    distance = round(distance, 0)
    return distance

创建KNN分类器:

# 指定算法为ball_tree
knn = KNeighborsClassifier(n_neighbors=1,
                           algorithm='ball_tree',
                           metric=lambda s1, s2: distancefuc(*s1, *s2))

训练模型并计算获取结果

训练:

knn.fit(data_fit, y)

结果:

KNeighborsClassifier(algorithm='ball_tree',
                     metric=<function <lambda> at 0x000000000E7C7D90>,
                     n_neighbors=1)

使用knn算法计算它们最近的10个点:

distance, points = knn.kneighbors(find_x, n_neighbors=10, return_distance=True)
print(distance[:5])
print(points[:5])
[[   0.    0.    0.    0.    0.  115.  121.  121.  121.  167.]
 [   0.    0.    0. 1093. 1093. 1093. 1657. 1657. 1657. 2509.]
 [   0.  293.  293.  293.  293.  839.  839.  839. 1069. 1069.]
 [   0.    0.    0.    0.    0.    0. 1514. 1514. 2358. 2358.]
 [   0.    0.    0. 1093. 1093. 1093. 1657. 1657. 1657. 2509.]]
[[9517 9137 9139 9136 9516 9492 9016 9015 9017 7559]
 [8234 8235 8233 5190 5189 5188 5867 5865 5866 8878]
 [9209 9283 6041 6039 6040 6848 6846 6845 6864 6865]
 [3775 3776 7916 3774 7918 7917 8722 8723 1751 6749]
 [8234 8235 8233 5190 5189 5188 5867 5865 5866 8878]]

结果整理

上面KNN算法已经计算出了结果,现在我将一顿骚操作,把结果整理一下,让结果看起来比较好看,再保存起来。

对于第一条被查找的数据:

find.iloc[0]

结果:

基站中文名    LFZ贵港港北区唐人街1号楼17F弱电井I4
lon                     109.596
lat                     23.0943
Name: 0, dtype: object

如何转换成Datafream呢?

s = pd.DataFrame(find.iloc[0]).T
s

结果:

基站中文名lonlat
0LFZ贵港港北区唐人街1号楼17F弱电井I4109.59623.0943

如何获取这个基站的最近的10个基站的数据呢?

tmp = data.iloc[points[0]]
tmp

结果:

CELL_NAMELONLAT
9517newLFZ贵港市唐人街室分22I109.59644023.094330
9137LFZ贵港港北区唐人街1号楼17F弱电井I4109.59644023.094330
9139LFZ贵港港北区唐人街2号楼11FI6109.59644023.094330
9136LFZ贵港港北区唐人街A区B1F弱电井I3109.59644023.094330
9516newLFZ贵港市唐人街室分21I109.59644023.094330
9492newLFZ贵港市港北区港福时代广场1栋2单元负2楼电梯旁IQ17109.59746123.094762
9016LFZ贵港市凤凰二街华隆超市微站O50109.59722623.095149
9015LFZ贵港市凤凰二街华隆超市微站O49109.59722623.095149
9017LFZ贵港市凤凰二街华隆超市微站O51109.59722623.095149
7559newLFZ贵港市桥北商贸城O20109.59679023.092860

再加上距离:

tmp['距离'] = distance[0]
tmp

结果:

CELL_NAMELONLAT距离
9517newLFZ贵港市唐人街室分22I109.59644023.0943300.0
9137LFZ贵港港北区唐人街1号楼17F弱电井I4109.59644023.0943300.0
9139LFZ贵港港北区唐人街2号楼11FI6109.59644023.0943300.0
9136LFZ贵港港北区唐人街A区B1F弱电井I3109.59644023.0943300.0
9516newLFZ贵港市唐人街室分21I109.59644023.0943300.0
9492newLFZ贵港市港北区港福时代广场1栋2单元负2楼电梯旁IQ17109.59746123.094762115.0
9016LFZ贵港市凤凰二街华隆超市微站O50109.59722623.095149121.0
9015LFZ贵港市凤凰二街华隆超市微站O49109.59722623.095149121.0
9017LFZ贵港市凤凰二街华隆超市微站O51109.59722623.095149121.0
7559newLFZ贵港市桥北商贸城O20109.59679023.092860167.0

对被查找的基站和结果数据进行合并:

s['距离'] = '被求点0'
s.columns = tmp.columns
tmp = s.append(tmp)
tmp

结果:

CELL_NAMELONLAT距离
0LFZ贵港港北区唐人街1号楼17F弱电井I4109.59623.0943被求点0
9517newLFZ贵港市唐人街室分22I109.59623.09430
9137LFZ贵港港北区唐人街1号楼17F弱电井I4109.59623.09430
9139LFZ贵港港北区唐人街2号楼11FI6109.59623.09430
9136LFZ贵港港北区唐人街A区B1F弱电井I3109.59623.09430
9516newLFZ贵港市唐人街室分21I109.59623.09430
9492newLFZ贵港市港北区港福时代广场1栋2单元负2楼电梯旁IQ17109.59723.0948115
9016LFZ贵港市凤凰二街华隆超市微站O50109.59723.0951121
9015LFZ贵港市凤凰二街华隆超市微站O49109.59723.0951121
9017LFZ贵港市凤凰二街华隆超市微站O51109.59723.0951121
7559newLFZ贵港市桥北商贸城O20109.59723.0929167

最终合并代码:

result = pd.DataFrame()
for i, row in find.iterrows():
    tmp = data.iloc[points[i]]
    tmp['距离'] = distance[i]
    s = pd.DataFrame(row).T
    s['距离'] = f'被求点{i}'
    s.columns = tmp.columns
    tmp = s.append(tmp)
    result = result.append(tmp)
result
CELL_NAMELONLAT距离
0LFZ贵港港北区唐人街1号楼17F弱电井I4109.59623.0943被求点0
9517newLFZ贵港市唐人街室分22I109.59623.09430
9137LFZ贵港港北区唐人街1号楼17F弱电井I4109.59623.09430
9139LFZ贵港港北区唐人街2号楼11FI6109.59623.09430
9136LFZ贵港港北区唐人街A区B1F弱电井I3109.59623.09430
9058LFZ贵港市园博园微站7O7109.56123.0741223
9052LFZ贵港市园博园微站1O1109.55923.0764229
9053LFZ贵港市园博园微站2O2109.55923.0756229
9059LFZ贵港市园博园微站8O8109.56223.0739252
9060LFZ贵港市园博园微站9O9109.56223.0741257

176 rows × 4 columns

保存结果

result.to_excel(r"D:/hdfs/excel/result/10base.xlsx", index=False)

整体完整代码

import pandas as pd

find = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
                     sheet_name="需找出最近距离的基站",
                     usecols=[1, 2, 3])
data = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
                     sheet_name="现网基站数据库",
                     usecols=[1, 2, 3])
# 从基站数据库中筛选出经纬度特征数据,用于给KNN分类器训练
data_fit = data.iloc[:, [1, 2]]
# y本身用于标注每条数据属于哪个类别,但我并不使用KNN的分类功能,所以统一全部标注为类别1
y = [1] * len(data_fit)
# 筛选需要求出最近10个点的的基站的经纬度特征数据
find_x = find.iloc[:, [1, 2]]

# 导入KNN分类器
from sklearn.neighbors import KNeighborsClassifier  #KNN
from math import *


# 创建用于计算两个经纬度距离的函数
def distancefuc(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(
        radians,
        [float(lon1), float(lat1),
         float(lon2), float(lat2)])  # 经纬度转换成弧度
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    distance = 2 * asin(sqrt(a)) * 6371 * 1000  # 地球平均半径,6371km
    distance = round(distance, 0)
    return distance


# 指定算法为ball_tree
knn = KNeighborsClassifier(n_neighbors=1,
                           algorithm='ball_tree',
                           metric=lambda s1, s2: distancefuc(*s1, *s2))
# 训练模型
knn.fit(data_fit, y)
# 计算它们最近的10个点
distance, points = knn.kneighbors(find_x, n_neighbors=10, return_distance=True)

result = pd.DataFrame()
for i, row in find.iterrows():
    tmp = data.iloc[points[i]]
    tmp['距离'] = distance[i]
    s = pd.DataFrame(row).T
    s['距离'] = f'被求点{i}'
    s.columns = tmp.columns
    tmp = s.append(tmp)
    result = result.append(tmp)
result.to_excel(r"D:/hdfs/excel/result/10base.xlsx", index=False)

求连接的基站不在最近6个基站内的采样点

做完上面的需求,结果又来了类似的需求:

image-20200611191721381

数据读取

导包:

import pandas as pd

读取基站经纬度信息:

data = pd.read_excel(r"D:\hdfs\excel\网格和周围LTE站点.xlsx",usecols=[0,2,3])
data
CellNameLongitudeLatitude
0FSSDRongGuiHengDeLouDC-EFW-1113.28250122.767101
1FSSDRongGuiNanJieTanDiWCDC-EFW-1113.26701022.766701
2FSSDRongGuiNanJieTanDiWCDC-EFW-2113.26701022.766701
3FSSDRongGuiNanJieTanDiWCDC-EFW-3113.26701022.766701
4FSSDRongGuiNanJieTanDiWCDC-EFW-4113.26701022.766701
583FSSDRongGuiQingGuiGLGYQDDC-EFH-6113.32605022.775870
584FSSDRongGuiRongGangLuBeiDC-EFH-4113.28691022.781120
585FSSDRongGuiRongGangLuBeiDC-EFH-5113.28691022.781120
586FSSDRongGuiRongGangLuBeiDC-EFH-6113.28691022.781120
587FSSDDaLiangCaiHongLuBanQDC-EFH-1113.29271022.798050

588 rows × 3 columns

读取采样点数据:

find = pd.read_excel(r"D:/hdfs/excel/GPS采样点.xlsx", usecols=[2, 3, 6, 8])
find
LongitudeLatitudeECICELLNAME
0113.27249322.752590231103049FSSDRongGuiHongXingDC-EFH-3
1113.27250222.752585231103049FSSDRongGuiHongXingDC-EFH-3
2113.27250322.752584231103049FSSDRongGuiHongXingDC-EFH-3
3113.27250322.752584231103049FSSDRongGuiHongXingDC-EFH-3
4113.27254822.752557231103049FSSDRongGuiHongXingDC-EFH-3
5066113.31008922.784417231128648FSSDDaLiangDeShengQiaoBQDC-EFH-2
5067113.31009122.784417231128648FSSDDaLiangDeShengQiaoBQDC-EFH-2
5068113.31010322.784415231128648FSSDDaLiangDeShengQiaoBQDC-EFH-2
5069113.31010322.784415231128648FSSDDaLiangDeShengQiaoBQDC-EFH-2
5070113.31010322.784415231128648FSSDDaLiangDeShengQiaoBQDC-EFH-2

5071 rows × 4 columns

将基站名称转换为索引

构建一个用于查询基站名称和对应索引的DataFream:

cellName_index = data[['CellName']].reset_index()
cellName_index = cellName_index.rename(columns={"index": "cell_index"})
cellName_index
cell_indexCellName
00FSSDRongGuiHengDeLouDC-EFW-1
11FSSDRongGuiNanJieTanDiWCDC-EFW-1
22FSSDRongGuiNanJieTanDiWCDC-EFW-2
33FSSDRongGuiNanJieTanDiWCDC-EFW-3
44FSSDRongGuiNanJieTanDiWCDC-EFW-4
583583FSSDRongGuiQingGuiGLGYQDDC-EFH-6
584584FSSDRongGuiRongGangLuBeiDC-EFH-4
585585FSSDRongGuiRongGangLuBeiDC-EFH-5
586586FSSDRongGuiRongGangLuBeiDC-EFH-6
587587FSSDDaLiangCaiHongLuBanQDC-EFH-1

588 rows × 2 columns

将采样点数据中正在连接的基站的名称转换为在基站数据库中的索引:

find = find.merge(cellName_index,
                  left_on='CELLNAME',
                  right_on='CellName',
                  copy=False)
find = find[["Longitude", "Latitude", "ECI", "cell_index"]]
find
LongitudeLatitudeECIcell_index
0113.27249322.75259023110304914
1113.27250222.75258523110304914
2113.27250322.75258423110304914
3113.27250322.75258423110304914
4113.27254822.75255723110304914
5066113.31008922.78441723112864866
5067113.31009122.78441723112864866
5068113.31010322.78441523112864866
5069113.31010322.78441523112864866
5070113.31010322.78441523112864866

获取最近的6个点

筛选需要进行训练的特征数据:

data_fit = data.iloc[:, [1, 2]]
y = [1] * len(data_fit)
find_fit = find.iloc[:, [0, 1]]
print(data_fit.head())
print(find_fit.head())
    Longitude   Latitude
0  113.282501  22.767101
1  113.267010  22.766701
2  113.267010  22.766701
3  113.267010  22.766701
4  113.267010  22.766701
    Longitude   Latitude
0  113.272493  22.752590
1  113.272502  22.752585
2  113.272503  22.752584
3  113.272503  22.752584
4  113.272548  22.752557

使用KNN分类器获取结果:

from sklearn.neighbors import KNeighborsClassifier  #KNN
from math import *


def distancefuc(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(
        radians,
        [float(lon1), float(lat1),
         float(lon2), float(lat2)])  # 经纬度转换成弧度
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    distance = 2 * asin(sqrt(a)) * 6371 * 1000  # 地球平均半径,6371km
    distance = round(distance, 0)
    return distance


knn = KNeighborsClassifier(n_neighbors=1,
                           algorithm='ball_tree',
                           metric=lambda s1, s2: distancefuc(*s1, *s2))
knn.fit(data_fit, y)
points = knn.kneighbors(find_fit, n_neighbors=6, return_distance=False)
points[:5]

结果:

array([[ 14,  12,  13,  87,  88,  86],
       [ 14,  12,  13,  87,  88,  86],
       [ 14,  12,  13,  87,  88,  86],
       [ 14,  12,  13,  87,  88,  86],
       [ 14,  12,  13, 101, 103, 102]], dtype=int64)

获取连接的基站不在最近6个基站内的采样点

result = pd.DataFrame(
    [row for i, row in find.iterrows() if not row.cell_index in points[i]])
result
LongitudeLatitudeECIcell_index
8113.27270122.752470231108424.033.0
9113.27270122.752470231108424.033.0
10113.27270222.752470231108424.033.0
11113.27273522.752454231108424.033.0
12113.27274322.752450231108424.033.0
5027113.30551022.76252794409287.0170.0
5028113.30552822.76250794409287.0170.0
5029113.30553722.76249794409287.0170.0
5030113.30553722.76249794409287.0170.0
5031113.30553722.76249694409287.0170.0

1279 rows × 4 columns

整体完整代码

import pandas as pd


# 读取基站经纬度信息:
data = pd.read_excel(r"D:\hdfs\excel\网格和周围LTE站点.xlsx",usecols=[0,2,3])
# 读取采样点数据:
find = pd.read_excel(r"D:/hdfs/excel/GPS采样点.xlsx", usecols=[2, 3, 6, 8])

# ## 将基站名称转换为索引

# 构建一个用于查询基站名称和对应索引的DataFream:
cellName_index = data[['CellName']].reset_index()
cellName_index = cellName_index.rename(columns={"index": "cell_index"})
# 将采样点数据中正在连接的基站的名称转换为在基站数据库中的索引:
find = find.merge(cellName_index,
                  left_on='CELLNAME',
                  right_on='CellName',
                  copy=False)
find = find[["Longitude", "Latitude", "ECI", "cell_index"]]

# ## 获取最近的6个点
# 筛选需要进行训练的特征数据:
data_fit = data.iloc[:, [1, 2]]
y = [1] * len(data_fit)
find_fit = find.iloc[:, [0, 1]]

# 使用KNN分类器获取结果:
from sklearn.neighbors import KNeighborsClassifier  #KNN
from math import *

def distancefuc(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(
        radians,
        [float(lon1), float(lat1),
         float(lon2), float(lat2)])  # 经纬度转换成弧度
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    distance = 2 * asin(sqrt(a)) * 6371 * 1000  # 地球平均半径,6371km
    distance = round(distance, 0)
    return distance

knn = KNeighborsClassifier(n_neighbors=1,
                           algorithm='ball_tree',
                           metric=lambda s1, s2: distancefuc(*s1, *s2))
knn.fit(data_fit, y)
points = knn.kneighbors(find_fit, n_neighbors=6, return_distance=False)

# ## 获取连接的基站不在最近6个基站内的采样点
result = pd.DataFrame(
    [row for i, row in find.iterrows() if not row.cell_index in points[i]])
result
LongitudeLatitudeECIcell_index
8113.27270122.752470231108424.033.0
9113.27270122.752470231108424.033.0
10113.27270222.752470231108424.033.0
11113.27273522.752454231108424.033.0
12113.27274322.752450231108424.033.0
5027113.30551022.76252794409287.0170.0
5028113.30552822.76250794409287.0170.0
5029113.30553722.76249794409287.0170.0
5030113.30553722.76249794409287.0170.0
5031113.30553722.76249694409287.0170.0

1279 rows × 4 columns


本文转载:CSDN博客