scrapy

工具：

Xpath scrapy-splash splash doc Splash的使用 beautifulsoup js2xml PhantomJS 高可用IP代理池 haipproxy pillow(图像处理工具)

docker run -p 8050:8050 scrapinghub/splash
docker run -it -p 8050:8050 --rm scrapinghub/splash --proxy-profiles-path=/etc/splash/proxy-profiles

Pipi.org scrapy-user-agents scrapy-proxies-tool ：使用说明

Scrapy-redis scrapy-redis快速上手/scrapy爬虫分布式改造分析scrapy-redis分布式爬虫的起步过程

Scrapy-redis 关闭蜘蛛：

https://blog.csdn.net/mr_hui_/article/details/81432952 https://www.thinbug.com/q/45540569 https://www.coder.work/article/545592

Selenium与PhantomJS

常见问题

自动限速(AutoThrottle)扩展

User_Agent

scrapy startproject tutorial  #创建项目
scrapy genspider mydomain mydomain.com  # 要创建新的 Spider
scrapy shell 'url'
scrapy crawl somespider -s JOBDIR=crawls/somespider-1 #暂停和恢复爬行
scrapy crawl quotes -o quotes.json 
scrapy crawl quotes -o quotes.jl

Selenium

DOC

DOC中文

中文手册

scrapy-selenium selenium 如何控制滚动条逐步滚动 geckodriver-autoinstaller Centos下实现python+selenium+firefox(geckodriver)

等待事件-预期条件（expected_condition）详解

about:config # 查看火狐浏览器设置
about:flags # 查看google浏览器设置

pyppeteer

Puppeteer

pyppeteer

Appium

splash

文档 splash-中文常见问题 docker-compose haproxy DOC

docker rm `docker ps -a | grep scrapinghub/splash | awk '{print $1}'`
sudo docker run -d -p 8050:8050 -p 5023:5023 scrapinghub/splash
--max-timeout 300 # 超时
--slots 100 #并发
--restart=always -d #自动重启
--disable-private-mode # 关闭私有模式,如果您关闭了私有模式，那么不同请求之间可能会使用同样的浏览器信息.如果 您在共享环境下使用Splash，您发送请求中的相关信息可能会影响其他用户发送的请求。
--disable-lua-sandbox # 禁用沙箱环境。默认情况下，Splash脚本在受限的环境中执行：并非所有标准Lua模块和功能都可用，Luarequire 受限制，并且存在资源限制（尽管很宽松）。
 # 如果您希望设置Splash使用的最大内存为4GB，并且加上守护进程，崩溃重启这些特性，您可以使用下面的命令：
docker run -d -p 8050:8050 --memory=4.5G --restart=always scrapinghub/splash --maxrss 4000

# pro
docker run -d -p 8065:8050 --restart=always scrapinghub/splash \
--max-timeout 300 --maxrss 2000 --slots 200 \
--disable-lua-sandbox

====== 注意配置容器日志文件大小限制！！ ======

https://docs.docker.com/config/containers/logging/configure/
https://docs.docker.com/config/containers/logging/local/

/etc/docker/daemon.json
sudo systemctl start docker   # 启动docker
systemctl enable docker # 开启自启动

systemctl daemon-reload
systemctl restart docker

--log-opt max-size=10m --log-opt max-file=1 \



-----

docker build -t splash-haproxy .

docker run -d --restart=always --name splash-haproxy \
-p 8080:8050 \
-p 8036:8036 splash-haproxy


docker swarm init --advertise-addr 134.73.227.2

docker service create --name splash-haproxy \
-p 8080:8050 \
-p 8036:8036 splash-haproxy

docker service scale splash-haproxy=3

docker service update --image splash-haproxy:latest splash-haproxy

------ cookiecutter docker-compose ------

sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

sudo chmod +x /usr/local/bin/docker-compose

Scrapy 部署工具

Scrapyd scrapyd-client

是由scrapy 官方提供的爬虫管理工具，使用它我们可以非常方便地上传、控制爬虫并且查看运行日志。

为了防止SSH中断连接使远程进程终止，推荐使用 Screen 管理每一条需要保持运行的命令

scrapyd注意事项

修改默认端口6800；以防被挖矿

部署问题； ModuleNotFoundError: No module named 'scrapy.utils.http' scrapy.utils.http replaced with the w3lib.http

pip install scrapyd   # 安装Scrapyd
scrapyd #启动该服务
#后台运行
(scrapyd > /dev/null 2>&1 &)

(python main.py > /dev/null 2>&1 &)
(scrapyd > /var/log/scrapyd.log &)
(uvicorn main:app --reload --host 0.0.0.0 > /www/wwwroot/google_search/logs/fastapi.log &)
pip freeze > requirements.txt

pip install scrapyd-client -i https://pypi.doubanio.com/simple/
pip install git+https://github.com/scrapy/scrapyd-client #安装客户端
scrapyd-deploy
scrapyd-deploy -a -p globalso_datacrawl   # 部署项目 
scrapyd-deploy globalso_datacrawl -p globalso_datacrawl 

curl http://localhost:6868/listspiders.json?project=translate
curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider #安排一次蜘蛛运行（也称为作业）
curl http://localhost:6800/cancel.json -d project=myproject -d job=237582f26ca811ea9ad8525400cf764c # 取消蜘蛛运行
curl http://localhost:6800/addversion.json -F project=myproject -F version=r23 -F egg=@myproject.egg  #向项目添加版本，如果不存在则创建项目。
curl http://localhost:6800/daemonstatus.json  #检查服务的负载状态

scrapyweb

scrapydweb
scrapydweb -ss 127.0.0.1:6969
(scrapydweb > /dev/null 2>&1 &)
(logparser > /dev/null 2>&1 &)
(logparser -ss 127.0.0.1:6868 -dir /www/wwwroot/googletranslate/logs > /dev/null 2>&1 &)

spiderkeeper

主要实现 scrapy 工程的部署，抓取任务状态监控，定时启动爬虫等功能。支持多个 scrapyd 服务，方便爬虫集群的管理

pip install spiderkeeper -i https://pypi.doubanio.com/simple/
pip freeze > requirements.txt
pipenv install spiderkeeper --pypi-mirror https://pypi.doubanio.com/simple/
spiderkeeper
(spiderkeeper > /dev/null &) 
spiderkeeper --server=http://localhost:6800  # scrapyd server http://localhost:6800

supervisor

是一个用 Python 写的进程管理工具，可以很方便的用来启动、重启、关闭进程（不仅仅是 Python 进程）

Scrapyd部署

SQL

SQLAlchemy Alembic官方文档迁移工具alembic scrapy+sqlalchemy 随机user-agents

使用 alembic 迁移数据库结构

scrapy-api

scrapy-api scrapyrt(实时抓拍) scrapyrt-doc Flask中集成Scrapy 如何整合Flask＆Scrapy？

抓包工具

mitmproxy

mitmproxy 二次代理

mitmweb --mode upstream:http://127.0.0.1:8001/ -s ./change_upstream_proxy.py    # 二次代理

# change_upstream_proxy.py
from mitmproxy import http

def request(flow: http.HTTPFlow) -> None:
    proxy = ("127.0.0.1", 8001)
    # 这里配置二级代理的ip地址和端口
    if flow.live:
        flow.live.change_upstream_proxy_server(proxy)

squid

如何在 CentOS 7上安装 Squid代理服务器

yum install squid
systemctl start squid
systemctl enable squid
systemctl status squid
systemctl restart squid

vi /etc/squid/squid.conf