利用Scikit-Learn和Spark预测Airbnb的listing价格_语言 & 开发_Nick Amato

50万奖金+官方证书，深圳国际金融科技大赛正式启动，点击报名了解详情 



 写点什么

机器学习最有用的应用之一是预测客户的行为。这有广泛的范围：帮助顾客作出最优的选择（大多数是性价比最高的一个）；让客户可以口碑相传你的产品；随着时间流逝建立忠诚的客户群体。当前顾客已不单单满足于从商品或者购物车中点击和购买，而是期待你提供智能化的推荐。

讲的很直白了。。。那实际情况下，你如何做到这些呢？让我们看下“分享经济”模式典范的 Airbnb 是如何做的，后续会从头到尾给出一个列子，使用 Python 和流行的 Scikit-Learn 库，基于 Airbnb 已公开的旧金山城市的数据。
这次作者将用一种不同以往的方法来使用 Apache Spark。常规情况下会使用 Spark MLlib 解决机器学习的问题。我们可以使用 spark-sklearn 集成开发包，扩展到多机器和多核运行，将会提高计算结果的速度和精度。

开始

我们基于 listing 属性开始 listing 价格预测。预测价格有几方面的应用：给客户提供建议的价格（价格太高或者太低都会显示提醒）；帮助广告商做广告；提供数据分析给市场做决策。每个数据集包含以下几个感兴趣的项：

listings.csv.gz：详细的 listing 数据，包含每个 listing 的各种属性，比如，卧室数目、浴室数目、位置等；
calendar.csv.gz：每个 listing 的日历信息；
reviews.csv.gz ：listing 的浏览数据；
neighborhoods and GeoJSON files：同城邻居的地图和详细信息。

本列子提供了详细的使用 Python 编程的 scikit-learn 应用以及如何使用 Spark 进行交叉验证和调超参数。我们使用 scikit-learn 的线性回归方法，然后借助 Spark 来提高穷举搜素的结果和速度，这里面用到 GridSearchCV 和 GradientBoostingRegressor 方法。

扫描数据和清洗数据

首先，从 MapR-FS 文件系统加载 listing.csv 数据集，创建一个 Pandas dataframe（备注：Pandas 是 Python 下一个开源数据分析的库，它提供的数据结构 DataFrame）。数据集大概包含 7000 条 listing，每个 listing 有 90 个不同的列，但不是每个列都有用，这里只挑选对最终的预测 listing 价格有用的几列。
代码如下：

复制代码

  
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn import ensemble
from sklearn import linear_model
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
from collections import Counter
 
 
LISTINGSFILE = '/mapr/tmclust1/user/mapr/pyspark-learn/airbnb/listings.csv'
 
cols = ['price',
       'accommodates',
       'bedrooms',
       'beds',
       'neighbourhood_cleansed',
       'room_type',
       'cancellation_policy',
       'instant_bookable',
       'reviews_per_month',
       'number_of_reviews',
       'availability_30',
       'review_scores_rating'
       ]
 
# read the file into a dataframe
df = pd.read_csv(LISTINGSFILE, usecols=cols)

neighborhood_cleansed 列是房主的邻居信息。你会看到这些信息分布不均衡，通过如下的图看出分布是个曲线，末尾的数量高，而靠左边非常少。总体来说，房主的邻居信息分布合理。

复制代码

  
nb_counts = Counter(df.neighbourhood_cleansed)
tdf = pd.DataFrame.from_dict(nb_counts, orient='index').sort_values(by=0)
tdf.plot(kind='bar')

下面对数据进行按序清洗。
number_reviews’和 reviews_per_month 两列看起来要去掉大量的 NaN 值（Python 中 NaN 值就是 NULL）。我们把 reviews_per_month 为 NaN 值的地方设置为 0，因为在某些数据分析中这些数据是有意义的。
我们去掉那些明显异常的数据，比如，卧室数目、床或者价格为 0 的 listing 记录，并且删除那些 NaN 值的行。最后的结果集有 5246 条，原始数据集为 7029 条。

复制代码

  
# first fixup 'reviews_per_month' where there are no reviews
df['reviews_per_month'].fillna(0, inplace=True)
 
# just drop rows with bad/weird values
# (we could do more here)
df = df[df.bedrooms != 0]
df = df[df.beds != 0]
df = df[df.price != 0]
df = df.dropna(axis=0)

清洗的最后一步，我们把 price 列的值转换成 float 型数据，只保留卧室的数目等于 1 的数据。拥有一个卧室的数据大概有 70%（在大城市，旧金山，这个数字还算正常），这里对这类数据进行分析。回归分析只对单个类型的数据进行分析，回归模型很少会和其他特征进行复杂的交互。为了对多个类型的数据进行预测，可以选择对不同的类型数据（比如，分为拥有 2、3、4 个卧室）单独进行建模，或者通过聚类对那些很容易区分开来的数据进行分析。

复制代码

  
df = df[df.bedrooms == 1]
 
# remove the $ from the price and convert to float
df['price'] = df['price'].replace('[\$,)]','',  \
       regex=True).replace('[(]','-', regex=True).astype(float)

类别变量处理

数据集中有几列包含分类变量。根据可能存在的值有几种处理方法。
neighborhood_cleansed 列是邻居的名字，string 类型。scikit-learn 中的回归分析只接受数值类型的列。对于这类变量，使用 Pandas 的 get_dummies 转换成虚拟变量，这个处理过程也叫“one hot”编码，每个 listing 行都包含一个“1”对应她／他的邻居。我们用类似的方法处理 cancellation_policy 和 room_type 列。

复制代码

  
instant_bookable 列是个 boolean 类型的值。
# get feature encoding for categorical variables
n_dummies = pd.get_dummies(df.neighbourhood_cleansed)
rt_dummies = pd.get_dummies(df.room_type)
xcl_dummies = pd.get_dummies(df.cancellation_policy)
 
# convert boolean column to a single boolean value indicating whether this listing has instant booking available
ib_dummies = pd.get_dummies(df.instant_bookable, prefix="instant")
ib_dummies = ib_dummies.drop('instant_f', axis=1)
 
# replace the old columns with our new one-hot encoded ones
alldata = pd.concat((df.drop(['neighbourhood_cleansed', \
   'room_type', 'cancellation_policy', 'instant_bookable'], axis=1), \
   n_dummies.astype(int), rt_dummies.astype(int), \
   xcl_dummies.astype(int), ib_dummies.astype(int)), \
   axis=1)
allcols = alldata.columns

接下来用 Pandas 的 scatter_matrix 函数快速的显示各个特征的矩阵，并检查特征间的共线性。本列子中共线性不明显，因为我们仅仅挑选列一小部分特征集，而且互相明显不相关。

复制代码

  
scattercols = ['price','accommodates', 'number_of_reviews', 'reviews_per_month', 'beds', 'availability_30', 'review_scores_rating']
axs = pd.scatter_matrix(alldata[scattercols],
                       figsize=(12, 12), c='red')

(点击放大图像)

scatter_matrix 的输出结果发现并没有什么明显的问题。最相近的特征应该是 beds 和 accommodates。

开始预测

scikit-learn 最大的优势是我们可以在相同的数据集上做不同的线性模型，这可以给我们一些调参的提示。我们开始使用其中的六种：vanilla linear regression, ridge and lasso regressions, ElasticNet, bayesian ridge 和 Orthogonal Matching Pursuit。

为了评估这些模型哪个更好，我们需要一种对其进行打分，这里采用绝对中位误差。说到这里，很可能会出现异常值，因为我们没有对数据集进行过滤或者聚合。

复制代码

  
rs = 1
ests = [ linear_model.LinearRegression(), linear_model.Ridge(),
       linear_model.Lasso(), linear_model.ElasticNet(),
       linear_model.BayesianRidge(), linear_model.OrthogonalMatchingPursuit() ]
ests_labels = np.array(['Linear', 'Ridge', 'Lasso', 'ElasticNet', 'BayesRidge', 'OMP'])
errvals = np.array([])
 
X_train, X_test, y_train, y_test = train_test_split(alldata.drop(['price'], axis=1),
                                                   alldata.price, test_size=0.2, random_state=20)
 
for e in ests:
   e.fit(X_train, y_train)
   this_err = metrics.median_absolute_error(y_test, e.predict(X_test))
   #print "got error %0.2f" % this_err
   errvals = np.append(errvals, this_err)
 
pos = np.arange(errvals.shape[0])
srt = np.argsort(errvals)
plt.figure(figsize=(7,5))
plt.bar(pos, errvals[srt], align='center')
plt.xticks(pos, ests_labels[srt])
plt.xlabel('Estimator')
plt.ylabel('Median Absolute Error')

看下六种评估器得出的结果大体的相同，通过中位误差预测的结果是 30 到 35 美元。最终的结果惊人的相似，主要原因是我们未做任何调参。

接下来我们继续集成方法来获取更好的结果。集成方法的优势在于可以获得更好的结果，副作用便是超参数的“飘忽不定”，所以得调参。每个参数都会影响我们的模型，必须要求实验得出正确结构。最常用的方法是网格搜索法（grid search）暴力尝试所有的超参数，用交叉验证去找到最好的一个模型。Scikit-learn 提供 GridSearchCV 函数正是为了这个目的。

使用 GridSearchCV 需要权衡穷举搜索和交叉验证所耗费的 CPU 和时间。这地方就是为什么我们使用 Spark 进行分布式搜索，让我们更快的去组合特征。

我们第一个尝试将限制参数的数目为了更快的得到结果，最后看下是不是超参数会比单个方法要好。

复制代码

  
n_est = 300
 
tuned_parameters = {
   "n_estimators": [ n_est ],
   "max_depth" : [ 4 ],
   "learning_rate": [ 0.01 ],
   "min_samples_split" : [ 1 ],
   "loss" : [ 'ls', 'lad' ]
}
 
gbr = ensemble.GradientBoostingRegressor()
clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters,
       scoring='median_absolute_error')
preds = clf.fit(X_train, y_train)
best = clf.best_estimator_

这次尝试的中位误差是 23.64 美元。已经可以看出用 GradientBoostingRegressor 比前面那次任何一种方法的结果都要好，没有做任何调优，中位误差已经比前面那组里最好的中位误差（使用 BayesRidge() 方法）还要少 20%。

让我们看下每步 boosting 的误差，这样可以帮助我们找到迭代过程遇到的问题。

复制代码

  
# plot error for each round of boosting
test_score = np.zeros(n_est, dtype=np.float64)
 
train_score = best.train_score_
for i, y_pred in enumerate(best.staged_predict(X_test)):
   test_score[i] = best.loss_(y_test, y_pred)
 
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(np.arange(n_est), train_score, 'darkblue', label='Training Set Error')
plt.plot(np.arange(n_est), test_score, 'red', label='Test Set Error')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Least Absolute Deviation')

从曲线可以看出，曲线右边到 200-250 次迭代到位置仍然可以通过迭代获得好的结果，所以我们增加迭代次数到 500。

接下来使用 GridSearchCV 进行各种超参数组合，这需要 CPU 和数小时。使用 spark-sklearn 集成可以减少错误和时间。

复制代码

  
from pyspark import SparkContext, SparkConf
from spark_sklearn import GridSearchCV
 
conf = SparkConf()
sc = SparkContext(conf=conf)
clf = GridSearchCV(sc, gbr, cv=3, param_grid=tuned_parameters, scoring='median_absolute_error')

至此，我们看下这种 spark-sklearn 集成架构的优势。spark-sklearn 集成提供了跨 Spark executor 对每个模型进行分布式交叉验证；而 Spark MLlib 只是在集群间实际的机器学习算法间进行分布式计算。spark-sklearn 集成主要的优势是结合了 scikit-learn 机器学习丰富的模型集合，这些算法虽然可以在单个机器上并行运算但是不能在集群间进行运行。

采用这种方法最后优化的中位差结果是 21.43 美元，并且还缩短了运行时间，如下图所示。集群为 4 个节点，以 Spark YARN client 模式提交，每个节点配置如下：
Machine: HP DL380 G6
Memory: 128G
CPU: (2x) Intel X5560
Disk: (6x) 1TB 7200RPM disks

最后让我们看下特征的重要性，下面显示特征的相对重要性。

复制代码

  
feature_importance = clf.best_estimator_.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
pvals = feature_importance[sorted_idx]
pcols = X_train.columns[sorted_idx]
plt.figure(figsize=(8,12))
plt.barh(pos, pvals, align='center')
plt.yticks(pos, pcols)
plt.xlabel('Relative Importance')
plt.title('Variable Importance')

(点击放大图像)

很明显的是有一些变量比其他变量更重要，最重要的特征是 Entire home/apt。

结论

这个列子展示了如何使用 spark-sklearn 进行多变量来预测 listing 价格，然后进行分布式交叉验证和超参数搜索，并给出以下几点参考：

GradientBoostingRegressor 等集成方法比单个方法得出的结果要好；
使用 GridSearchCV 函数可以测试更多的超参数组合来得到更优的结果；
使用 spark-sklearn 能更好节约 CPU 和时间，减少评估错误。

译者介绍

侠天，专注于大数据、机器学习和数学相关的内容，并有个人公众号：bigdata_ny 分享相关技术文章。

查看英文原文： Predicting Airbnb Listing Prices with Scikit-Learn and Apache Spark

发布

暂无评论

【FAQ】HarmonyOS SDK 闭源开放能力 —Share Kit(2)

HarmonyOS SDK

harmoyos

文献解读-Sentieon DNAscope LongRead – A highly Accurate, Fast, and Efficient Pipeline for Germline Variant Calling from PacBio HiFi

INSVAST

长读长测序 Sentieon 变异分析 DNAscope LongRead 生信分析服务

创作场景

利用 Scikit-Learn 和 Spark 预测 Airbnb 的 listing 价格

开始

扫描数据和清洗数据

类别变量处理

开始预测

结论

译者介绍

评论

【FAQ】HarmonyOS SDK 闭源开放能力 —Share Kit(2)

文献解读-Sentieon DNAscope LongRead – A highly Accurate, Fast, and Efficient Pipeline for Germline Variant Calling from PacBio HiFi

重磅预告 | Apache SeaTunnel接入MCP，即将解锁模型上下文协议超能力！

数安智用·科技强警｜万里红依托“三大优势×五大能力”受邀参展第十二届警博会

天下拍-资产拍卖经典案例分享

VMware NSX 4.2.2 发布，新增功能概览

AI题库软件系统的技术难点

AI for All，Code for All｜七牛云 AI 开源项目扶持计划全面启动

YashanDB V23.4 LTS 正式发布｜两地三中心、库级闪回重磅特性上线，生产级可用性再升级

京东商品列表接口 item_search 深度解析

和鲸支持！南大人工智能通识课，让每个学生都懂AI

通义灵码入职表现实测：蔚来汽车AI 生成代码占比在 30% 以上

什么是区块链dapp开发?它能做什么?

公链开发及其配套设施:钱包与区块链浏览器

交易所功能设计的核心架构与创新实践

破解RL训练崩溃难题，快手联合中科院、清华、南大提出多模态奖励模型R1-Reward！

Metasploit Pro 4.22.7-2025051201 (Linux, Windows) - 专业渗透测试框架

他为SeaTunnel写下10+高质量PR，还把开源带进了公司生产线！

企业跨国组网怎么选？MPLS与SD-WAN方案对比

源码交付+可控部署：用户行为分析系统的落地经验

Shotcut 25.05 (Linux, macOS, Windows) - 免费开源视频编辑器

A10 vThunder 6.0.5 - 虚拟化应用交付控制器 (ADC)

手把手教你抓取京东商品评论：API 接口解析与 Python 实战

通义灵码入职表现实测：蔚来汽车AI 生成代码占比在 30% 以上

A10 ACOS 6 - 专为现代应用程序设计的开放式云就绪操作系统

重塑“DATA+AI“的共生范式：DataBuilder如何赋能企业数据价值跃迁

挖到项目中的2高危和中危漏洞

Apple Safari 18.5 - macOS 专属浏览器 (独立安装包下载)

A10 Thunder 6.0.5 - 应用交付与负载均衡

WhaleTunnel 信创数据库适配能力全景图：打通国产数据生态的最后一公里

10 分钟快速搭建一款面试刷题小程序


	%matplotlib inline
	import pandas as pd
	import numpy as np
	from sklearn import ensemble
	from sklearn import linear_model
	from sklearn.grid_search import GridSearchCV
	from sklearn import preprocessing
	from sklearn.cross_validation import train_test_split
	import sklearn.metrics as metrics
	import matplotlib.pyplot as plt
	from collections import Counter


	LISTINGSFILE = '/mapr/tmclust1/user/mapr/pyspark-learn/airbnb/listings.csv'

	cols = ['price',
	'accommodates',
	'bedrooms',
	'beds',
	'neighbourhood_cleansed',
	'room_type',
	'cancellation_policy',
	'instant_bookable',
	'reviews_per_month',
	'number_of_reviews',
	'availability_30',
	'review_scores_rating'
	]

	# read the file into a dataframe
	df = pd.read_csv(LISTINGSFILE, usecols=cols)


	nb_counts = Counter(df.neighbourhood_cleansed)
	tdf = pd.DataFrame.from_dict(nb_counts, orient='index').sort_values(by=0)
	tdf.plot(kind='bar')


	# first fixup 'reviews_per_month' where there are no reviews
	df['reviews_per_month'].fillna(0, inplace=True)

	# just drop rows with bad/weird values
	# (we could do more here)
	df = df[df.bedrooms != 0]
	df = df[df.beds != 0]
	df = df[df.price != 0]
	df = df.dropna(axis=0)


	df = df[df.bedrooms == 1]

	# remove the $ from the price and convert to float
	df['price'] = df['price'].replace('[\$,)]','', \
	regex=True).replace('[(]','-', regex=True).astype(float)


	instant_bookable 列是个 boolean 类型的值。
	# get feature encoding for categorical variables
	n_dummies = pd.get_dummies(df.neighbourhood_cleansed)
	rt_dummies = pd.get_dummies(df.room_type)
	xcl_dummies = pd.get_dummies(df.cancellation_policy)

	# convert boolean column to a single boolean value indicating whether this listing has instant booking available
	ib_dummies = pd.get_dummies(df.instant_bookable, prefix="instant")
	ib_dummies = ib_dummies.drop('instant_f', axis=1)

	# replace the old columns with our new one-hot encoded ones
	alldata = pd.concat((df.drop(['neighbourhood_cleansed', \
	'room_type', 'cancellation_policy', 'instant_bookable'], axis=1), \
	n_dummies.astype(int), rt_dummies.astype(int), \
	xcl_dummies.astype(int), ib_dummies.astype(int)), \
	axis=1)
	allcols = alldata.columns


	scattercols = ['price','accommodates', 'number_of_reviews', 'reviews_per_month', 'beds', 'availability_30', 'review_scores_rating']
	axs = pd.scatter_matrix(alldata[scattercols],
	figsize=(12, 12), c='red')


	rs = 1
	ests = [ linear_model.LinearRegression(), linear_model.Ridge(),
	linear_model.Lasso(), linear_model.ElasticNet(),
	linear_model.BayesianRidge(), linear_model.OrthogonalMatchingPursuit() ]
	ests_labels = np.array(['Linear', 'Ridge', 'Lasso', 'ElasticNet', 'BayesRidge', 'OMP'])
	errvals = np.array([])

	X_train, X_test, y_train, y_test = train_test_split(alldata.drop(['price'], axis=1),
	alldata.price, test_size=0.2, random_state=20)

	for e in ests:
	e.fit(X_train, y_train)
	this_err = metrics.median_absolute_error(y_test, e.predict(X_test))
	#print "got error %0.2f" % this_err
	errvals = np.append(errvals, this_err)

	pos = np.arange(errvals.shape[0])
	srt = np.argsort(errvals)
	plt.figure(figsize=(7,5))
	plt.bar(pos, errvals[srt], align='center')
	plt.xticks(pos, ests_labels[srt])
	plt.xlabel('Estimator')
	plt.ylabel('Median Absolute Error')


	n_est = 300

	tuned_parameters = {
	"n_estimators": [ n_est ],
	"max_depth" : [ 4 ],
	"learning_rate": [ 0.01 ],
	"min_samples_split" : [ 1 ],
	"loss" : [ 'ls', 'lad' ]
	}

	gbr = ensemble.GradientBoostingRegressor()
	clf = GridSearchCV(gbr, cv=3, param_grid=tuned_parameters,
	scoring='median_absolute_error')
	preds = clf.fit(X_train, y_train)
	best = clf.best_estimator_


	# plot error for each round of boosting
	test_score = np.zeros(n_est, dtype=np.float64)

	train_score = best.train_score_
	for i, y_pred in enumerate(best.staged_predict(X_test)):
	test_score[i] = best.loss_(y_test, y_pred)

	plt.figure(figsize=(12, 6))
	plt.subplot(1, 2, 1)
	plt.plot(np.arange(n_est), train_score, 'darkblue', label='Training Set Error')
	plt.plot(np.arange(n_est), test_score, 'red', label='Test Set Error')
	plt.legend(loc='upper right')
	plt.xlabel('Boosting Iterations')
	plt.ylabel('Least Absolute Deviation')


	from pyspark import SparkContext, SparkConf
	from spark_sklearn import GridSearchCV

	conf = SparkConf()
	sc = SparkContext(conf=conf)
	clf = GridSearchCV(sc, gbr, cv=3, param_grid=tuned_parameters, scoring='median_absolute_error')


	feature_importance = clf.best_estimator_.feature_importances_
	feature_importance = 100.0 * (feature_importance / feature_importance.max())
	sorted_idx = np.argsort(feature_importance)
	pos = np.arange(sorted_idx.shape[0]) + .5
	pvals = feature_importance[sorted_idx]
	pcols = X_train.columns[sorted_idx]
	plt.figure(figsize=(8,12))
	plt.barh(pos, pvals, align='center')
	plt.yticks(pos, pcols)
	plt.xlabel('Relative Importance')
	plt.title('Variable Importance')

创作场景

利用 Scikit-Learn 和 Spark 预测 Airbnb 的 listing 价格

开始

扫描数据和清洗数据

类别变量处理

开始预测

结论

译者介绍

评论

更多内容推荐

推荐阅读

电子书

大厂实战PPT下载