作者:小小明,Pandas数据处理专家,致力于帮助无数数据从业者解决数据处理难题。
这是半年前写的一篇文章,里面涉及的方法可能有些过时,但处理思想仍有较高的参考价值,现在发布到csdn。
文章目录
有位朋友咨询了我这样一个问题:
然后我最终就通过KNN算法的查找临近节点的函数实现了这个功能,现在分享给大家实现过程。
数据形式
基站数据库数据
import pandas as pd
data = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
sheet_name="现网基站数据库",
usecols=[1, 2, 3])
data
CELL_NAME | LON | LAT | |
---|---|---|---|
0 | LFZ贵港市电信城北局O4 | 109.59128 | 23.11093 |
1 | LFZ贵港市电信城北局O5 | 109.59128 | 23.11093 |
2 | LFZ贵港市电信城北局O6 | 109.59128 | 23.11093 |
3 | LFZ贵港市电信城北局O10 | 109.59128 | 23.11093 |
4 | LFZ贵港市电信城北局O49 | 109.59128 | 23.11093 |
… | … | … | … |
9514 | LFZ贵港市人民医院外科楼25F楼梯间I5 | 109.59482 | 23.09028 |
9515 | LFZ贵港市人民医院妇科楼13F楼梯间I6 | 109.59482 | 23.09028 |
9516 | newLFZ贵港市唐人街室分21I | 109.59644 | 23.09433 |
9517 | newLFZ贵港市唐人街室分22I | 109.59644 | 23.09433 |
9518 | newLFZ贵港市港北区锦泰公馆1栋2单元B1F电梯旁B1FIQ20 | 109.58687 | 23.10925 |
9519 rows × 3 columns
共9519条示例数据。
需找出最近距离的基站,共16条示例数据:
需找出最近距离的基站数据
find = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
sheet_name="需找出最近距离的基站",
usecols=[1, 2, 3])
find
基站中文名 | lon | lat | |
---|---|---|---|
0 | LFZ贵港港北区唐人街1号楼17F弱电井I4 | 109.596440 | 23.094330 |
1 | newLFZ桂平社坡镇禄全村二(村西山头)O19 | 110.190283 | 23.353451 |
2 | LFZ桂平市高铁站I10 | 110.112020 | 23.318939 |
3 | newLFZ桂平大洋镇什字村二(旺冲良)O56 | 109.980278 | 23.112500 |
4 | newLFZ桂平社坡镇禄全村二(村西山头)O17 | 110.190283 | 23.353451 |
5 | newLFZ桂平社坡镇禄全村二(村西山头)O18 | 110.190283 | 23.353451 |
6 | LFZ贵港市塔山部队装甲团O53 | 109.651942 | 23.106775 |
7 | newLFZ贵港黄练镇水村O54 | 109.253430 | 23.190430 |
8 | LFZ贵港市世纪经典悉尼座4单元12楼线井I11 | 109.588060 | 23.111116 |
9 | LFZ贵港黄练镇新谭三中O57 | 109.274750 | 23.180370 |
10 | LFZ贵港奇石乡六马村六良屯O52 | 109.637593 | 23.365757 |
11 | newLFZ贵港市体育中心西看台4楼I5 | 109.559160 | 23.115079 |
12 | newLFZ贵港市体育中心综合馆I2 | 109.559160 | 23.115079 |
13 | LFZ桂平市郁江湾7号楼B1F电井外墙 I8 | 110.072415 | 23.381235 |
14 | newLFZ贵港山北乡大王村O53 | 109.411290 | 23.353670 |
15 | LFZ贵港市园博园主展馆2FI10 | 109.561264 | 23.076138 |
现在我们的目标就是找出这16个基站每个基站在基站数据库中最近的10个基站。
常规做法是暴力遍历,那么每个基站都要遍历9519次,当前示例数据还勉强可以接受,但是基站数据库规模一旦达到10万以上,那计算起来就耗时很久了。为了提高计算效率,我利用KNN算法的ball_tree快速计算。
注意:sklearn的KNN算法还提供了brute也可以自定义距离函数,但经过实践发现ball_tree的计算速度会更快一些。
使用KNN分类器计算每个基站最近的10个基站
筛选用于训练的经纬度特征数据
从基站数据库中筛选出经纬度特征数据:
# 从基站数据库中筛选出经纬度特征数据,用于给KNN分类器训练
data_fit = data.iloc[:, [1, 2]]
# y本身用于标注每条数据属于哪个类别,但我并不使用KNN的分类功能,所以统一全部标注为类别1
y = [1] * len(data_fit)
data_fit
LON | LAT | |
---|---|---|
0 | 109.59128 | 23.11093 |
1 | 109.59128 | 23.11093 |
2 | 109.59128 | 23.11093 |
3 | 109.59128 | 23.11093 |
4 | 109.59128 | 23.11093 |
… | … | … |
9514 | 109.59482 | 23.09028 |
9515 | 109.59482 | 23.09028 |
9516 | 109.59644 | 23.09433 |
9517 | 109.59644 | 23.09433 |
9518 | 109.58687 | 23.10925 |
9519 rows × 2 columns
筛选需要求出最近10个点的的基站的经纬度特征数据:
# 筛选需要求出最近10个点的的基站的经纬度特征数据
find_x = find.iloc[:, [1, 2]]
find_x.head()
lon | lat | |
---|---|---|
0 | 109.596440 | 23.094330 |
1 | 110.190283 | 23.353451 |
2 | 110.112020 | 23.318939 |
3 | 109.980278 | 23.112500 |
4 | 110.190283 | 23.353451 |
构建KNN分类器
导入KNN分类器:
# 导入KNN分类器
from sklearn.neighbors import KNeighborsClassifier #KNN
from math import *
创建用于计算两个经纬度距离的函数:
# 创建用于计算两个经纬度距离的函数
def distancefuc(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(
radians,
[float(lon1), float(lat1),
float(lon2), float(lat2)]) # 经纬度转换成弧度
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
distance = 2 * asin(sqrt(a)) * 6371 * 1000 # 地球平均半径,6371km
distance = round(distance, 0)
return distance
创建KNN分类器:
# 指定算法为ball_tree
knn = KNeighborsClassifier(n_neighbors=1,
algorithm='ball_tree',
metric=lambda s1, s2: distancefuc(*s1, *s2))
训练模型并计算获取结果
训练:
knn.fit(data_fit, y)
结果:
KNeighborsClassifier(algorithm='ball_tree',
metric=<function <lambda> at 0x000000000E7C7D90>,
n_neighbors=1)
使用knn算法计算它们最近的10个点:
distance, points = knn.kneighbors(find_x, n_neighbors=10, return_distance=True)
print(distance[:5])
print(points[:5])
[[ 0. 0. 0. 0. 0. 115. 121. 121. 121. 167.]
[ 0. 0. 0. 1093. 1093. 1093. 1657. 1657. 1657. 2509.]
[ 0. 293. 293. 293. 293. 839. 839. 839. 1069. 1069.]
[ 0. 0. 0. 0. 0. 0. 1514. 1514. 2358. 2358.]
[ 0. 0. 0. 1093. 1093. 1093. 1657. 1657. 1657. 2509.]]
[[9517 9137 9139 9136 9516 9492 9016 9015 9017 7559]
[8234 8235 8233 5190 5189 5188 5867 5865 5866 8878]
[9209 9283 6041 6039 6040 6848 6846 6845 6864 6865]
[3775 3776 7916 3774 7918 7917 8722 8723 1751 6749]
[8234 8235 8233 5190 5189 5188 5867 5865 5866 8878]]
结果整理
上面KNN算法已经计算出了结果,现在我将一顿骚操作,把结果整理一下,让结果看起来比较好看,再保存起来。
对于第一条被查找的数据:
find.iloc[0]
结果:
基站中文名 LFZ贵港港北区唐人街1号楼17F弱电井I4
lon 109.596
lat 23.0943
Name: 0, dtype: object
如何转换成Datafream呢?
s = pd.DataFrame(find.iloc[0]).T
s
结果:
基站中文名 | lon | lat | |
---|---|---|---|
0 | LFZ贵港港北区唐人街1号楼17F弱电井I4 | 109.596 | 23.0943 |
如何获取这个基站的最近的10个基站的数据呢?
tmp = data.iloc[points[0]]
tmp
结果:
CELL_NAME | LON | LAT | |
---|---|---|---|
9517 | newLFZ贵港市唐人街室分22I | 109.596440 | 23.094330 |
9137 | LFZ贵港港北区唐人街1号楼17F弱电井I4 | 109.596440 | 23.094330 |
9139 | LFZ贵港港北区唐人街2号楼11FI6 | 109.596440 | 23.094330 |
9136 | LFZ贵港港北区唐人街A区B1F弱电井I3 | 109.596440 | 23.094330 |
9516 | newLFZ贵港市唐人街室分21I | 109.596440 | 23.094330 |
9492 | newLFZ贵港市港北区港福时代广场1栋2单元负2楼电梯旁IQ17 | 109.597461 | 23.094762 |
9016 | LFZ贵港市凤凰二街华隆超市微站O50 | 109.597226 | 23.095149 |
9015 | LFZ贵港市凤凰二街华隆超市微站O49 | 109.597226 | 23.095149 |
9017 | LFZ贵港市凤凰二街华隆超市微站O51 | 109.597226 | 23.095149 |
7559 | newLFZ贵港市桥北商贸城O20 | 109.596790 | 23.092860 |
再加上距离:
tmp['距离'] = distance[0]
tmp
结果:
CELL_NAME | LON | LAT | 距离 | |
---|---|---|---|---|
9517 | newLFZ贵港市唐人街室分22I | 109.596440 | 23.094330 | 0.0 |
9137 | LFZ贵港港北区唐人街1号楼17F弱电井I4 | 109.596440 | 23.094330 | 0.0 |
9139 | LFZ贵港港北区唐人街2号楼11FI6 | 109.596440 | 23.094330 | 0.0 |
9136 | LFZ贵港港北区唐人街A区B1F弱电井I3 | 109.596440 | 23.094330 | 0.0 |
9516 | newLFZ贵港市唐人街室分21I | 109.596440 | 23.094330 | 0.0 |
9492 | newLFZ贵港市港北区港福时代广场1栋2单元负2楼电梯旁IQ17 | 109.597461 | 23.094762 | 115.0 |
9016 | LFZ贵港市凤凰二街华隆超市微站O50 | 109.597226 | 23.095149 | 121.0 |
9015 | LFZ贵港市凤凰二街华隆超市微站O49 | 109.597226 | 23.095149 | 121.0 |
9017 | LFZ贵港市凤凰二街华隆超市微站O51 | 109.597226 | 23.095149 | 121.0 |
7559 | newLFZ贵港市桥北商贸城O20 | 109.596790 | 23.092860 | 167.0 |
对被查找的基站和结果数据进行合并:
s['距离'] = '被求点0'
s.columns = tmp.columns
tmp = s.append(tmp)
tmp
结果:
CELL_NAME | LON | LAT | 距离 | |
---|---|---|---|---|
0 | LFZ贵港港北区唐人街1号楼17F弱电井I4 | 109.596 | 23.0943 | 被求点0 |
9517 | newLFZ贵港市唐人街室分22I | 109.596 | 23.0943 | 0 |
9137 | LFZ贵港港北区唐人街1号楼17F弱电井I4 | 109.596 | 23.0943 | 0 |
9139 | LFZ贵港港北区唐人街2号楼11FI6 | 109.596 | 23.0943 | 0 |
9136 | LFZ贵港港北区唐人街A区B1F弱电井I3 | 109.596 | 23.0943 | 0 |
9516 | newLFZ贵港市唐人街室分21I | 109.596 | 23.0943 | 0 |
9492 | newLFZ贵港市港北区港福时代广场1栋2单元负2楼电梯旁IQ17 | 109.597 | 23.0948 | 115 |
9016 | LFZ贵港市凤凰二街华隆超市微站O50 | 109.597 | 23.0951 | 121 |
9015 | LFZ贵港市凤凰二街华隆超市微站O49 | 109.597 | 23.0951 | 121 |
9017 | LFZ贵港市凤凰二街华隆超市微站O51 | 109.597 | 23.0951 | 121 |
7559 | newLFZ贵港市桥北商贸城O20 | 109.597 | 23.0929 | 167 |
最终合并代码:
result = pd.DataFrame()
for i, row in find.iterrows():
tmp = data.iloc[points[i]]
tmp['距离'] = distance[i]
s = pd.DataFrame(row).T
s['距离'] = f'被求点{i}'
s.columns = tmp.columns
tmp = s.append(tmp)
result = result.append(tmp)
result
CELL_NAME | LON | LAT | 距离 | |
---|---|---|---|---|
0 | LFZ贵港港北区唐人街1号楼17F弱电井I4 | 109.596 | 23.0943 | 被求点0 |
9517 | newLFZ贵港市唐人街室分22I | 109.596 | 23.0943 | 0 |
9137 | LFZ贵港港北区唐人街1号楼17F弱电井I4 | 109.596 | 23.0943 | 0 |
9139 | LFZ贵港港北区唐人街2号楼11FI6 | 109.596 | 23.0943 | 0 |
9136 | LFZ贵港港北区唐人街A区B1F弱电井I3 | 109.596 | 23.0943 | 0 |
… | … | … | … | … |
9058 | LFZ贵港市园博园微站7O7 | 109.561 | 23.0741 | 223 |
9052 | LFZ贵港市园博园微站1O1 | 109.559 | 23.0764 | 229 |
9053 | LFZ贵港市园博园微站2O2 | 109.559 | 23.0756 | 229 |
9059 | LFZ贵港市园博园微站8O8 | 109.562 | 23.0739 | 252 |
9060 | LFZ贵港市园博园微站9O9 | 109.562 | 23.0741 | 257 |
176 rows × 4 columns
保存结果
result.to_excel(r"D:/hdfs/excel/result/10base.xlsx", index=False)
整体完整代码
import pandas as pd
find = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
sheet_name="需找出最近距离的基站",
usecols=[1, 2, 3])
data = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
sheet_name="现网基站数据库",
usecols=[1, 2, 3])
# 从基站数据库中筛选出经纬度特征数据,用于给KNN分类器训练
data_fit = data.iloc[:, [1, 2]]
# y本身用于标注每条数据属于哪个类别,但我并不使用KNN的分类功能,所以统一全部标注为类别1
y = [1] * len(data_fit)
# 筛选需要求出最近10个点的的基站的经纬度特征数据
find_x = find.iloc[:, [1, 2]]
# 导入KNN分类器
from sklearn.neighbors import KNeighborsClassifier #KNN
from math import *
# 创建用于计算两个经纬度距离的函数
def distancefuc(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(
radians,
[float(lon1), float(lat1),
float(lon2), float(lat2)]) # 经纬度转换成弧度
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
distance = 2 * asin(sqrt(a)) * 6371 * 1000 # 地球平均半径,6371km
distance = round(distance, 0)
return distance
# 指定算法为ball_tree
knn = KNeighborsClassifier(n_neighbors=1,
algorithm='ball_tree',
metric=lambda s1, s2: distancefuc(*s1, *s2))
# 训练模型
knn.fit(data_fit, y)
# 计算它们最近的10个点
distance, points = knn.kneighbors(find_x, n_neighbors=10, return_distance=True)
result = pd.DataFrame()
for i, row in find.iterrows():
tmp = data.iloc[points[i]]
tmp['距离'] = distance[i]
s = pd.DataFrame(row).T
s['距离'] = f'被求点{i}'
s.columns = tmp.columns
tmp = s.append(tmp)
result = result.append(tmp)
result.to_excel(r"D:/hdfs/excel/result/10base.xlsx", index=False)
求连接的基站不在最近6个基站内的采样点
做完上面的需求,结果又来了类似的需求:
数据读取
导包:
import pandas as pd
读取基站经纬度信息:
data = pd.read_excel(r"D:\hdfs\excel\网格和周围LTE站点.xlsx",usecols=[0,2,3])
data
CellName | Longitude | Latitude | |
---|---|---|---|
0 | FSSDRongGuiHengDeLouDC-EFW-1 | 113.282501 | 22.767101 |
1 | FSSDRongGuiNanJieTanDiWCDC-EFW-1 | 113.267010 | 22.766701 |
2 | FSSDRongGuiNanJieTanDiWCDC-EFW-2 | 113.267010 | 22.766701 |
3 | FSSDRongGuiNanJieTanDiWCDC-EFW-3 | 113.267010 | 22.766701 |
4 | FSSDRongGuiNanJieTanDiWCDC-EFW-4 | 113.267010 | 22.766701 |
… | … | … | … |
583 | FSSDRongGuiQingGuiGLGYQDDC-EFH-6 | 113.326050 | 22.775870 |
584 | FSSDRongGuiRongGangLuBeiDC-EFH-4 | 113.286910 | 22.781120 |
585 | FSSDRongGuiRongGangLuBeiDC-EFH-5 | 113.286910 | 22.781120 |
586 | FSSDRongGuiRongGangLuBeiDC-EFH-6 | 113.286910 | 22.781120 |
587 | FSSDDaLiangCaiHongLuBanQDC-EFH-1 | 113.292710 | 22.798050 |
588 rows × 3 columns
读取采样点数据:
find = pd.read_excel(r"D:/hdfs/excel/GPS采样点.xlsx", usecols=[2, 3, 6, 8])
find
Longitude | Latitude | ECI | CELLNAME | |
---|---|---|---|---|
0 | 113.272493 | 22.752590 | 231103049 | FSSDRongGuiHongXingDC-EFH-3 |
1 | 113.272502 | 22.752585 | 231103049 | FSSDRongGuiHongXingDC-EFH-3 |
2 | 113.272503 | 22.752584 | 231103049 | FSSDRongGuiHongXingDC-EFH-3 |
3 | 113.272503 | 22.752584 | 231103049 | FSSDRongGuiHongXingDC-EFH-3 |
4 | 113.272548 | 22.752557 | 231103049 | FSSDRongGuiHongXingDC-EFH-3 |
… | … | … | … | … |
5066 | 113.310089 | 22.784417 | 231128648 | FSSDDaLiangDeShengQiaoBQDC-EFH-2 |
5067 | 113.310091 | 22.784417 | 231128648 | FSSDDaLiangDeShengQiaoBQDC-EFH-2 |
5068 | 113.310103 | 22.784415 | 231128648 | FSSDDaLiangDeShengQiaoBQDC-EFH-2 |
5069 | 113.310103 | 22.784415 | 231128648 | FSSDDaLiangDeShengQiaoBQDC-EFH-2 |
5070 | 113.310103 | 22.784415 | 231128648 | FSSDDaLiangDeShengQiaoBQDC-EFH-2 |
5071 rows × 4 columns
将基站名称转换为索引
构建一个用于查询基站名称和对应索引的DataFream:
cellName_index = data[['CellName']].reset_index()
cellName_index = cellName_index.rename(columns={"index": "cell_index"})
cellName_index
cell_index | CellName | |
---|---|---|
0 | 0 | FSSDRongGuiHengDeLouDC-EFW-1 |
1 | 1 | FSSDRongGuiNanJieTanDiWCDC-EFW-1 |
2 | 2 | FSSDRongGuiNanJieTanDiWCDC-EFW-2 |
3 | 3 | FSSDRongGuiNanJieTanDiWCDC-EFW-3 |
4 | 4 | FSSDRongGuiNanJieTanDiWCDC-EFW-4 |
… | … | … |
583 | 583 | FSSDRongGuiQingGuiGLGYQDDC-EFH-6 |
584 | 584 | FSSDRongGuiRongGangLuBeiDC-EFH-4 |
585 | 585 | FSSDRongGuiRongGangLuBeiDC-EFH-5 |
586 | 586 | FSSDRongGuiRongGangLuBeiDC-EFH-6 |
587 | 587 | FSSDDaLiangCaiHongLuBanQDC-EFH-1 |
588 rows × 2 columns
将采样点数据中正在连接的基站的名称转换为在基站数据库中的索引:
find = find.merge(cellName_index,
left_on='CELLNAME',
right_on='CellName',
copy=False)
find = find[["Longitude", "Latitude", "ECI", "cell_index"]]
find
Longitude | Latitude | ECI | cell_index | |
---|---|---|---|---|
0 | 113.272493 | 22.752590 | 231103049 | 14 |
1 | 113.272502 | 22.752585 | 231103049 | 14 |
2 | 113.272503 | 22.752584 | 231103049 | 14 |
3 | 113.272503 | 22.752584 | 231103049 | 14 |
4 | 113.272548 | 22.752557 | 231103049 | 14 |
… | … | … | … | … |
5066 | 113.310089 | 22.784417 | 231128648 | 66 |
5067 | 113.310091 | 22.784417 | 231128648 | 66 |
5068 | 113.310103 | 22.784415 | 231128648 | 66 |
5069 | 113.310103 | 22.784415 | 231128648 | 66 |
5070 | 113.310103 | 22.784415 | 231128648 | 66 |
获取最近的6个点
筛选需要进行训练的特征数据:
data_fit = data.iloc[:, [1, 2]]
y = [1] * len(data_fit)
find_fit = find.iloc[:, [0, 1]]
print(data_fit.head())
print(find_fit.head())
Longitude Latitude
0 113.282501 22.767101
1 113.267010 22.766701
2 113.267010 22.766701
3 113.267010 22.766701
4 113.267010 22.766701
Longitude Latitude
0 113.272493 22.752590
1 113.272502 22.752585
2 113.272503 22.752584
3 113.272503 22.752584
4 113.272548 22.752557
使用KNN分类器获取结果:
from sklearn.neighbors import KNeighborsClassifier #KNN
from math import *
def distancefuc(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(
radians,
[float(lon1), float(lat1),
float(lon2), float(lat2)]) # 经纬度转换成弧度
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
distance = 2 * asin(sqrt(a)) * 6371 * 1000 # 地球平均半径,6371km
distance = round(distance, 0)
return distance
knn = KNeighborsClassifier(n_neighbors=1,
algorithm='ball_tree',
metric=lambda s1, s2: distancefuc(*s1, *s2))
knn.fit(data_fit, y)
points = knn.kneighbors(find_fit, n_neighbors=6, return_distance=False)
points[:5]
结果:
array([[ 14, 12, 13, 87, 88, 86],
[ 14, 12, 13, 87, 88, 86],
[ 14, 12, 13, 87, 88, 86],
[ 14, 12, 13, 87, 88, 86],
[ 14, 12, 13, 101, 103, 102]], dtype=int64)
获取连接的基站不在最近6个基站内的采样点
result = pd.DataFrame(
[row for i, row in find.iterrows() if not row.cell_index in points[i]])
result
Longitude | Latitude | ECI | cell_index | |
---|---|---|---|---|
8 | 113.272701 | 22.752470 | 231108424.0 | 33.0 |
9 | 113.272701 | 22.752470 | 231108424.0 | 33.0 |
10 | 113.272702 | 22.752470 | 231108424.0 | 33.0 |
11 | 113.272735 | 22.752454 | 231108424.0 | 33.0 |
12 | 113.272743 | 22.752450 | 231108424.0 | 33.0 |
… | … | … | … | … |
5027 | 113.305510 | 22.762527 | 94409287.0 | 170.0 |
5028 | 113.305528 | 22.762507 | 94409287.0 | 170.0 |
5029 | 113.305537 | 22.762497 | 94409287.0 | 170.0 |
5030 | 113.305537 | 22.762497 | 94409287.0 | 170.0 |
5031 | 113.305537 | 22.762496 | 94409287.0 | 170.0 |
1279 rows × 4 columns
整体完整代码
import pandas as pd
# 读取基站经纬度信息:
data = pd.read_excel(r"D:\hdfs\excel\网格和周围LTE站点.xlsx",usecols=[0,2,3])
# 读取采样点数据:
find = pd.read_excel(r"D:/hdfs/excel/GPS采样点.xlsx", usecols=[2, 3, 6, 8])
# ## 将基站名称转换为索引
# 构建一个用于查询基站名称和对应索引的DataFream:
cellName_index = data[['CellName']].reset_index()
cellName_index = cellName_index.rename(columns={"index": "cell_index"})
# 将采样点数据中正在连接的基站的名称转换为在基站数据库中的索引:
find = find.merge(cellName_index,
left_on='CELLNAME',
right_on='CellName',
copy=False)
find = find[["Longitude", "Latitude", "ECI", "cell_index"]]
# ## 获取最近的6个点
# 筛选需要进行训练的特征数据:
data_fit = data.iloc[:, [1, 2]]
y = [1] * len(data_fit)
find_fit = find.iloc[:, [0, 1]]
# 使用KNN分类器获取结果:
from sklearn.neighbors import KNeighborsClassifier #KNN
from math import *
def distancefuc(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(
radians,
[float(lon1), float(lat1),
float(lon2), float(lat2)]) # 经纬度转换成弧度
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
distance = 2 * asin(sqrt(a)) * 6371 * 1000 # 地球平均半径,6371km
distance = round(distance, 0)
return distance
knn = KNeighborsClassifier(n_neighbors=1,
algorithm='ball_tree',
metric=lambda s1, s2: distancefuc(*s1, *s2))
knn.fit(data_fit, y)
points = knn.kneighbors(find_fit, n_neighbors=6, return_distance=False)
# ## 获取连接的基站不在最近6个基站内的采样点
result = pd.DataFrame(
[row for i, row in find.iterrows() if not row.cell_index in points[i]])
result
Longitude | Latitude | ECI | cell_index | |
---|---|---|---|---|
8 | 113.272701 | 22.752470 | 231108424.0 | 33.0 |
9 | 113.272701 | 22.752470 | 231108424.0 | 33.0 |
10 | 113.272702 | 22.752470 | 231108424.0 | 33.0 |
11 | 113.272735 | 22.752454 | 231108424.0 | 33.0 |
12 | 113.272743 | 22.752450 | 231108424.0 | 33.0 |
… | … | … | … | … |
5027 | 113.305510 | 22.762527 | 94409287.0 | 170.0 |
5028 | 113.305528 | 22.762507 | 94409287.0 | 170.0 |
5029 | 113.305537 | 22.762497 | 94409287.0 | 170.0 |
5030 | 113.305537 | 22.762497 | 94409287.0 | 170.0 |
5031 | 113.305537 | 22.762496 | 94409287.0 | 170.0 |
1279 rows × 4 columns