看我30分钟写的爬掉整个stackoverflow的代码~

小马爸爸的笔记 2020-02-20

495

标题上，我承认我有点赌的成分，但是我赌对了。今天打算水一篇文章，不过给的是可以直接用的代码。爬stackoverflow上面问题和答案的一个小的php爬虫。不过你跑上一段时间数据就可以搞个chinses版本的stackoverflow了。

用的是phpspider，传说中爬了知乎的那个框架。作者之前写的文章标题是这个《我用爬虫一天时间“偷了”知乎一百万用户，只为证明PHP是世界上最好的语言》，感兴趣的朋友可以自己搜索看看。

作者还留了个彩蛋，实在是没意思，有点恶心人，抽调他就好了。

  1<?php
  2require_once __DIR__ . '/autoloader.php';
  3use phpspider\core\phpspider;
  4$configs = array(
  5    'name' => 'stackoverflow',
  6    'tasknum' => 1,
  7    'log_show' => 'true',
  8    'domains' => array(
  9        'stackoverflow.com'
 10    ),
 11    'scan_urls' => array(
 12        'https://stackoverflow.com/tags',
 13    ),
 14    'content_url_regexes' => array(
 15        "https://stackoverflow.com/questions/\d+/*",
 16        "/questions/\d+/*"
 17    ),
 18    'list_url_regexes' => array(
 19        "https://stackoverflow.com/questions/tagged/*",
 20        "/questions/tagged/*"
 21    ),
 22    'fields' => array(
 23        array(
 24            // 抽取标题
 25            'name' => "question_title",
 26            'selector' => "//a[@class='question-hyperlink']",
 27            'required' =>  false
 28        ),
 29        array(
 30            // 抽取问题
 31            'name' => "question_content",
 32            'selector' => "//div[@id='question']//div[@class='post-text']",
 33            'required' => false,
 34            'repeated' => true
 35        ),
 36        array(
 37            // 抽取提问人
 38            'name' => "question_authors",
 39            'selector' => "//div[@id='question']//div[@class='user-details']//a",
 40            'required' => false,
 41            'repeated' => true
 42        ),
 43        array(
 44            // 抽取标签
 45            'name' => "question_tags",
 46            'selector' => "//div[@id='question']//div[@class='grid ps-relative d-block']//a",
 47            'repeated' => true
 48        ),
 49        array(
 50            // 抽取投票数
 51            'name' => 'question_vote_num',
 52            'selector' => "//div[@id='question']//div[@class='js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center']"
 53        ),
 54        array(
 55            // 抽取回答内容
 56            'name' => 'question_answers_contents',
 57            'selector'=>"//div[@id='answers']//div[@class='post-text']",
 58            'repeated' => true
 59        ),
 60        array(
 61            // 抽取回答投票数量
 62            'name' => 'question_answers_votes',
 63            'selector'=>"//div[@id='answers']//div[@class='js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center']",
 64            'repeated' => true
 65        ),
 66        array(
 67            // 抽取回答作者
 68            'name' => 'question_answers_authors',
 69            'selector'=>"//div[@id='answers']//div[@class='user-details']//a",
 70            'repeated' => true
 71        ),
 72        array(
 73            // 抽取回答是否被采纳
 74            'name' => 'question_answers_checks',
 75            'selector'=>"//div[@id='answers']//div[@data-answerid]/@class",
 76            'repeated' => true
 77        ),
 78    ),
 79    'export' => array(
 80        'type' => 'sql', 
 81        'table' => 'stackoverflow',
 82        'file' => './stackoverflow.sql', // data目录下
 83    ),
 84);
 85$spider = new phpspider($configs);
 86$spider->on_extract_field = function($fieldname, $data, $page){
 87    if($fieldname == "question_answers_checks"){
 88        for($i=0; $i< count($data); $i++){
 89            if(strpos($data[$i], "accepted") === false){
 90                $data[$i] = "0";
 91            }
 92            else{
 93                $data[$i] = "1";
 94            }
 95        }
 96    }
 97    // 需要合并的数字
 98    $needJoinFields = array(
 99                               "question_answers_votes",
100                               "question_answers_checks",
101                           );
102    if(in_array($fieldname,$needJoinFields)){
103        return implode(",", $data);
104    }
105    // 需要编码+合并的信息
106    $needEncodeFields = array(
107                            "question_answers_contents",
108                            "question_answers_authors",
109                            "question_authors",
110                            "question_tags",
111                        );
112    if(in_array($fieldname,$needEncodeFields)){
113        for($i=0; $i< count($data); $i++){
114            $data[$i] = base64_encode($data[$i]);
115        }
116        return implode(",", $data);
117    }
118    return $data;
119};
120$spider->start();

在用的过程中发现内容页内的列表不太好整这个框架，现在这种方式不够优雅，暂时没其他思路，谁有思路可以讨论交流下哈。

[phpspider]https://github.com/owner888/phpspider

数据库

文章转载自小马爸爸的笔记，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

看我30分钟写的爬掉整个stackoverflow的代码~

评论