方法一:域名DNS托管到cloudflare
cloudflare具有一键屏蔽AI爬虫、屏蔽国家地区访问功能。如果你所在城市访问不了cloudflare,那就需要自己搞定梯子。有人说国内网站用了cloudflare会影响访问速度,其实速度影响不大。
方法二:宝塔防火墙设置屏蔽爬虫
将下面蜘蛛UA添加到nginx防火墙的UA黑名单里面:
Amazonbot
ClaudeBot
PetalBot
gptbot
Ahrefs
Semrush
Imagesift
Teoma
ia_archiver
twiceler
MSNBot
Scrubby
Robozilla
Gigabot
yahoo-mmcrawler
yahoo-blogs/v3.9
psbot
Scrapy
SemrushBot
AhrefsBot
Applebot
AspiegelBot
DotBot
DataForSeoBot
java
MJ12bot
python
seo
Censys
方法三:复制下面的代码,保存为robots.txt,上传到网站根目录。
User-agent: Baiduspider Disallow: / User-agent: Googlebot Disallow: / User-agent: googlebot-image Disallow: / User-agent: googlebot-mobile Disallow: / User-agent: Bingbot Disallow: / User-agent: Sosospider Disallow: / User-agent: sogou spider Disallow: / User-agent: YodaoBot Disallow: / User-agent: Ahrefs Disallow: / User-agent: Semrush Disallow: / User-agent: Imagesift Disallow: / User-agent: Amazonbot Disallow: / User-agent: gptbot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: PetalBot Disallow: / User-agent: Slurp Disallow: / User-agent: Teoma Disallow: / User-agent: ia_archiver Disallow: / User-agent: twiceler Disallow: / User-agent: MSNBot Disallow: / User-agent: Scrubby Disallow: / User-agent: Robozilla Disallow: / User-agent: Gigabot Disallow: / User-agent: yahoo-mmcrawler Disallow: / User-agent: yahoo-blogs/v3.9 Disallow: / User-agent: psbot Disallow: / User-agent: dotbot Disallow: /
方法四:防止网站被采集
在nginx配置文件里,加入下面内容:
#禁止Scrapy等工具的抓取 if ($http_user_agent ~* (Scrapy|Curl|HttpClient|crawl|curb|git|Wtrace)) { return 403; } #禁止指定UA及UA为空的访问 if ($http_user_agent ~* "CheckMarkNetwork|Synapse|Nimbostratus-Bot|Dark|scraper|LMAO|Hakai|Gemini|Wappalyzer|masscan|crawler4j|Mappy|Center|eright|aiohttp|MauiBot|Crawler|researchscan|Dispatch|AlphaBot|Census|ips-agent|NetcraftSurveyAgent|ToutiaoSpider|EasyHttp|Iframely|sysscan|fasthttp|muhstik|DeuSu|mstshash|HTTP_Request|ExtLinksBot|package|SafeDNSBot|CPython|SiteExplorer|SSH|MegaIndex|BUbiNG|CCBot|NetTrack|Digincore|aiHitBot|SurdotlyBot|null|SemrushBot|Test|Copied|ltx71|Nmap|DotBot|AdsBot|InetURL|Pcore-HTTP|PocketParser|Wotbox|newspaper|DnyzBot|redback|PiplBot|SMTBot|WinHTTP|Auto Spider 1.0|GrabNet|TurnitinBot|Go-Ahead-Got-It|Download Demon|Go!Zilla|GetWeb!|GetRight|libwww-perl|Cliqzbot|MailChimp|SMTBot|Dataprovider|XoviBot|linkdexbot|SeznamBot|Qwantify|spbot|evc-batch|zgrab|Go-http-client|FeedDemon|Jullo|Feedly|YandexBot|oBot|FlightDeckReports|Linguee Bot|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|EasouSpider|LinkpadBot|Ezooms|^$" ) { return 403; } #禁止非GET|HEAD|POST方式的抓取 if ($request_method !~ ^(GET|HEAD|POST)$) { return 403; }
正文完