先翻译了一下给的房屋数据的特征,这里定义了一个case class,方便理解每个特征的含义,
Kaggle的房价数据集使用的是,是美国爱荷华州的艾姆斯镇2006-2010年的房价
case class house( Id: String, MSSubClass: String, // 参与销售住宅的类型:有年代新旧等信息 MSZoning: String, // 房屋类型:农用,商用等 LotFrontage: String, // 距离街道的距离 LotArea: String, // 房屋的面积 Street: String, // 通向房屋的Street是用什么铺的 Alley: String, // 通向房屋的Alley是用什么铺的 LotShape: String, // 房屋的户型,规整程度 LandContour: String, // 房屋的平坦程度 Utilities: String, // 设施,通不通水电气 LotConfig: String, // 死路,处于三岔口等 LandSlope: String, // 坡度 Neighborhood: String, // 邻居 Condition1: String, Condition2: String, BldgType: String, // 住宅类型,住的家庭数,是否别墅等 HouseStyle: String, // 住宅类型,隔断等 OverallQual: String, // 房屋的质量 OverallCond: String, // 房屋位置的质量 YearBuilt: String, // 建造的时间 YearRemodAdd: String, // 改造的时间 RoofStyle: String, // 屋顶的类型 RoofMatl: String, // 屋顶的材料 Exterior1st: String, // 外观覆盖的材质 Exterior2nd: String, // 如果超过一种,则有第二种材质 MasVnrType: String, // 表层砌体类型 MasVnrArea: String, // 表层砌体面积 ExterQual: String, // 外观材料质量 ExterCond: String, // 外观材料情况 Foundation: String, // 地基类型 BsmtQual: String, // 地下室质量 BsmtCond: String, // 地下室的基本情况 BsmtExposure: String, // 地下室采光 BsmtFinType1: String, // 地下室的完成情况比例 BsmtFinSF1: String, // 地下室的完成面积 BsmtFinType2: String, // 如果有多个地下室的话 BsmtFinSF2: String, // 如果有多个地下室的话 BsmtUnfSF: String, // 未完成的地下室面积 TotalBsmtSF: String, // 地下室面积 Heating: String, // 供暖类型 HeatingQC: String, // 供暖质量 CentralAir: String, // 是否有中央空调 Electrical: String, // 电气系统 _1stFlrSF: String, // 1楼面积 _2ndFlrSF: String, // 2楼面积 LowQualFinSF: String, // 低质量完成的面积(楼梯占用的面积) GrLivArea: String, // 地面以上居住面积 BsmtFullBath: String, // 地下室都是洗手间 BsmtHalfBath: String, // 地下室一半是洗手间 FullBath: String, // 洗手间都在一层以上 HalfBath: String, // 一半洗手间在一层以上 BedroomAbvGr: String, // 卧室都在一层以上 KitchenAbvGr: String, // 厨房在一层以上 KitchenQual: String, // 厨房质量 TotRmsAbvGrd: String, // 所有房间都在一层以上 Functional: String, // 房屋的功能性等级 Fireplaces: String, // 壁炉位置 FireplaceQu: String, // 壁炉质量 GarageType: String, // 车库类型 GarageYrBlt: String, // 车库建造时间 GarageFinish: String, // 车库的室内装修 GarageCars: String, // 车库的汽车容量 GarageArea: String, // 车库面积 GarageQual: String, // 车库质量 GarageCond: String, // 车库情况 PavedDrive: String, // 铺路的材料 WoodDeckSF: String, // 木地板面积 OpenPorchSF: String, // 露天门廊面积 EnclosedPorch: String, // 独立门廊面积 _3SsnPorch: String, // three season门廊面积 ScreenPorch: String, // 纱门门廊面积 PoolArea: String, // 游泳池面积 PoolQC: String, // 游泳池质量 Fence: String, // 栅栏质量 MiscFeature: String, // 上面不包含其他功能 MiscVal: String, // 上面不包含其他功能的价格 MoSold: String, // 月销量 YrSold: String, // 年销量 SaleType: String, // 销售方式 SaleCondition: String, // 销售情况 SalePrice: String // 最后需要预测的销售价格 )
读取了数据之后先describe一下,查看有无缺失的数据,以及均值,样本标准偏差,最小值和最大值
+-------+-----------------+------------------+--------+-----------------+------------------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+------------------+------------------+------------------+------------------+---------+--------+-----------+-----------+----------+------------------+---------+---------+----------+--------+--------+------------+------------+-----------------+------------+-----------------+-----------------+------------------+-------+---------+----------+----------+-----------------+------------------+-----------------+-----------------+-------------------+--------------------+------------------+-------------------+------------------+-------------------+-----------+------------------+----------+------------------+-----------+----------+------------------+------------+------------------+-----------------+----------+----------+----------+------------------+-----------------+------------------+------------------+------------------+-----------------+------+-----+-----------+------------------+------------------+------------------+--------+-------------+------------------+|summary| Id| MSSubClass|MSZoning| LotFrontage| LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition2|BldgType|HouseStyle| OverallQual| OverallCond| YearBuilt| YearRemodAdd|RoofStyle|RoofMatl|Exterior1st|Exterior2nd|MasVnrType| MasVnrArea|ExterQual|ExterCond|Foundation|BsmtQual|BsmtCond|BsmtExposure|BsmtFinType1| BsmtFinSF1|BsmtFinType2| BsmtFinSF2| BsmtUnfSF| TotalBsmtSF|Heating|HeatingQC|CentralAir|Electrical| _1stFlrSF| _2ndFlrSF| LowQualFinSF| GrLivArea| BsmtFullBath| BsmtHalfBath| FullBath| HalfBath| BedroomAbvGr| KitchenAbvGr|KitchenQual| TotRmsAbvGrd|Functional| Fireplaces|FireplaceQu|GarageType| GarageYrBlt|GarageFinish| GarageCars| GarageArea|GarageQual|GarageCond|PavedDrive| WoodDeckSF| OpenPorchSF| EnclosedPorch| _3SsnPorch| ScreenPorch| PoolArea|PoolQC|Fence|MiscFeature| MiscVal| MoSold| YrSold|SaleType|SaleCondition| SalePrice|+-------+-----------------+------------------+--------+-----------------+------------------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+------------------+------------------+------------------+------------------+---------+--------+-----------+-----------+----------+------------------+---------+---------+----------+--------+--------+------------+------------+-----------------+------------+-----------------+-----------------+------------------+-------+---------+----------+----------+-----------------+------------------+-----------------+-----------------+-------------------+--------------------+------------------+-------------------+------------------+-------------------+-----------+------------------+----------+------------------+-----------+----------+------------------+------------+------------------+-----------------+----------+----------+----------+------------------+-----------------+------------------+------------------+------------------+-----------------+------+-----+-----------+------------------+------------------+------------------+--------+-------------+------------------+| count| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460| 1460|| mean| 730.5|56.897260273972606| null|70.04995836802665|10516.828082191782| null| null| null| null| null| null| null| null| null| null| null| null|6.0993150684931505| 5.575342465753424| 1971.267808219178|1984.8657534246574| null| null| null| null| null|103.68526170798899| null| null| null| null| null| null| null|443.6397260273973| null|46.54931506849315|567.2404109589041|1057.4294520547944| null| null| null| null|1162.626712328767|346.99246575342465|5.844520547945206|1515.463698630137|0.42534246575342466|0.057534246575342465|1.5650684931506849|0.38287671232876713|2.8664383561643834| 1.0465753424657533| null| 6.517808219178082| null| 0.613013698630137| null| null|1978.5061638868744| null|1.7671232876712328|472.9801369863014| null| null| null| 94.2445205479452|46.66027397260274|21.954109589041096|3.4095890410958902|15.060958904109588|2.758904109589041| null| null| null|43.489041095890414| 6.321917808219178|2007.8157534246575| null| null|180921.19589041095|| stddev|421.6100093688479| 42.30057099381045| null|24.28475177448321| 9981.26493237915| null| null| null| null| null| null| null| null| null| null| null| null|1.3829965467415926|1.1127993367127318|30.202904042525294| 20.64540680770938| null| null| null| null| null|181.06620658721647| null| null| null| null| null| null| null|456.0980908409278| null|161.3192728065416|441.8669552924343| 438.7053244594709| null| null| null| null|386.5877380410744| 436.528435886257|48.62308143352024|525.4803834232024| 0.5189106060898061| 0.23875264627921197|0.5509158012954318| 0.5028853810928912|0.8157780441442279|0.22033819838403076| null|1.6253932905840511| null|0.6446663863122297| null| null| 24.68972476859027| null|0.7473150101111095|213.8048414533803| null| null| null|125.33879435172422| 66.2560276766497| 61.11914860172857|29.317330556781872| 55.75741528187416|40.17730694453021| null| null| null| 496.1230244579441|2.7036262083595113|1.3280951205521145| null| null| 79442.50288288663|| min| 1| 120| C (all)| 100| 10000| Grvl| Grvl| IR1| Bnk| AllPub| Corner| Gtl| Blmngtn| Artery| Artery| 1Fam| 1.5Fin| 1| 1| 1872| 1950| Flat| ClyTile| AsbShng| AsbShng| BrkCmn| 0| Ex| Ex| BrkTil| Ex| Fa| Av| ALQ| 0| ALQ| 0| 0| 0| Floor| Ex| N| FuseA| 1001| 0| 0| 1002| 0| 0| 0| 0| 0| 0| Ex| 10| Maj1| 0| Ex| 2Types| 1900| Fin| 0| 0| Ex| Ex| N| 0| 0| 0| 0| 0| 0| Ex|GdPrv| Gar2| 0| 1| 2006| COD| Abnorml| 100000|| max| 999| 90| RM| NA| 9991| Pave| Pave| Reg| Lvl| NoSeWa| Inside| Sev| Veenker| RRNn| RRNn| TwnhsE| SLvl| 9| 9| 2010| 2010| Shed| WdShngl| WdShing| Wd Shng| Stone| NA| TA| TA| Wood| TA| TA| No| Unf| 998| Unf| 972| 999| 999| Wall| TA| Y| SBrkr| 999| 998| 80| 999| 3| 2| 3| 2| 8| 3| TA| 9| Typ| 3| TA| NA| NA| Unf| 4| 995| TA| TA| Y| 98| 99| 99| 96| 99| 738| NA| NA| TenC| 8300| 9| 2010| WD| Partial| 99900|+-------+-----------------+------------------+--------+-----------------+------------------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+------------------+------------------+------------------+------------------+---------+--------+-----------+-----------+----------+------------------+---------+---------+----------+--------+--------+------------+------------+-----------------+------------+-----------------+-----------------+------------------+-------+---------+----------+----------+-----------------+------------------+-----------------+-----------------+-------------------+--------------------+------------------+-------------------+------------------+-------------------+-----------+------------------+----------+------------------+-----------+----------+------------------+------------+------------------+-----------------+----------+----------+----------+------------------+-----------------+------------------+------------------+------------------+-----------------+------+-----+-----------+------------------+------------------+------------------+--------+-------------+------------------+
先看一下房价的分布情况,15W刀左右的居多,可以看出这个数据集是很久以前的了
1.先分析数值型的特征,比如LotFrontage(通往房屋的street的距离)和LotArea(房子的建筑面积,包括车库院子啥的)
分析:因为距离有的值为NA的缘故,对其值为NA的处理成了距离为0。
按我们的思路,应该是距离越近价格越高(出行比较方便),可是事实上并不是这样子,可能是私家车在外国家庭的普及率问题,和street的距离同房价之间的关系并不是很明显
接下来分析房屋的建筑面积和房价之间的关系,为了直观些,我把单位从平方英尺转换成了平方米。
可以看到数据集中房屋的面积的峰值在900平方米左右,且房屋的建筑面积和房价是正相关的(废话)
接下来再看看1层和2层面积和房价之间的关系
可以看到,1楼面积和2层面积(如果有的话),同房价之间的关系也很线性,除少数离群的点之外