高效爬取一线城市租房信息！

一如老师 2024-07-02

高效爬取一线城市租房信息！🏢📊

大家好！今天我想和大家分享一个非常实用的Python爬虫教程。我们将通过多线程并发请求，爬取链家网上的租房信息，并进行数据分析和可视化。🎉

租房信息对很多人来说非常重要，尤其是在一线城市。通过爬取和分析租房数据，我们可以更好地了解市场动态，为找房子提供可靠的参考。让我们一起开始吧！💻

项目概述

本项目的主要目标是：

无页面限制：代码将一直爬取页面，直到遇到没有数据的页面。
并发请求：使用 concurrent.futures.ThreadPoolExecutor
进行多线程并发请求，提高爬取速度。
动态任务分配：在爬取过程中不断提交新的请求，直到没有更多数据为止。
异常处理：在并发请求中加入异常处理，确保单个页面失败不会影响整个爬取过程。

代码详解

以下是完整的代码，包含详细注释，方便大家理解。🚀

import requests
from bs4 import BeautifulSoup
import pandas as pd
import concurrent.futures
import time

# 获取网页HTML内容
def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # 如果请求失败，抛出异常
    return response.text

# 解析HTML内容，提取租房信息
def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    data = []
    items = soup.find_all('div', class_='content__list--item--main')
    if not items:
        return None

    for item in items:
        rent_type_tag = item.find('p', class_='content__list--item--title twoline')
        location_tag = item.find('p', class_='content__list--item--des')
        area_tag = item.find('span', class_='content__list--item--area')
        direction_tag = item.find('span', class_='content__list--item--direction')
        price_tag = item.find('span', class_='content__list--item-price')

        rent_type = rent_type_tag.text.strip() if rent_type_tag else 'N/A'
        location = location_tag.text.strip().replace('\n', '').replace(' ', '') if location_tag else 'N/A'
        area = area_tag.text.strip() if area_tag else 'N/A'
        direction = direction_tag.text.strip() if direction_tag else 'N/A'
        price = price_tag.text.strip() if price_tag else 'N/A'

        data.append([rent_type, location, area, direction, price])
    return data

# 爬取单个页面的数据
def fetch_page_data(page, base_url):
    url = base_url.format(page)
    print(f'Fetching page {page}...')
    try:
        html = get_html(url)
        data = parse_html(html)
        return data
    except requests.RequestException as e:
        print(f'Error fetching page {page}: {e}')
        return None

# 保存数据到CSV文件
def save_to_csv(data, filename):
    df = pd.DataFrame(data, columns=['Rent Type', 'Location', 'Area', 'Direction', 'Price'])
    df.to_csv(filename, index=False, encoding='utf-8-sig')

# 主函数，执行爬虫任务
def main(city_code):
    base_url = f'https://{city_code}.lianjia.com/zufang/pg{{}}'
    all_data = []
    page = 1
    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        future_to_page = {}
        while True:
            # 动态分配任务，确保最多同时进行10个请求
            if len(future_to_page) < 10:
                future = executor.submit(fetch_page_data, page, base_url)
                future_to_page[future] = page
                page += 1

            for future in concurrent.futures.as_completed(future_to_page):
                page_number = future_to_page.pop(future)
                try:
                    data = future.result()
                    if data:
                        all_data.extend(data)
                    else:
                        print(f"No data found on page {page_number}. Stopping.")
                        save_to_csv(all_data, f'{city_code}zuf.csv')
                        print(f'Saved data to {city_code}zuf.csv')
                        return
                except Exception as exc:
                    print(f'Page {page_number} generated an exception: {exc}')
                    continue

if __name__ == '__main__':
    # city_code = 'bj'  # 'bj' 表示北京，用户可以根据需要更换其他城市代码
    city_code = input("请输入城市简称（如'bj'表示北京）:")
    main(city_code)

如何运行代码？

安装依赖：确保你已经安装了 requests
、beautifulsoup4
和 pandas
库。可以通过以下命令安装：
```
pip install requests beautifulsoup4 pandas
```
运行代码：保存代码到一个Python文件（例如 rent_spider.py
），然后在终端运行：
```
python rent_spider.py
```
输入城市代码：根据提示输入城市的简称，例如北京为 bj
。

数据分析与可视化

接下来我们将对爬取的数据进行分析和可视化。可以参考我之前发布的教程，了解更多数据分析和可视化的技巧。📊📈

希望这个教程对大家有所帮助，欢迎大家在评论区分享你们的爬虫经验和成果！如果有任何问题，随时留言，我会尽快回复。😉

小贴士：

爬取数据时要注意爬取频率，避免给服务器带来过大压力。
使用多线程可以显著提高爬取速度，但要注意合理控制并发量。
数据分析和可视化是非常重要的步骤，可以帮助我们更好地理解数据。

感谢大家的阅读！期待你们的反馈和建议！🙏

html代码 data

文章转载自一如老师，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。