nginx屏蔽垃圾无效蜘蛛_chaihongjun.me|柴宏俊web技术学习笔记

屏蔽蜘蛛的爬取有两种方法,，一种是通过Robots协议，这个属于君子协定不具备绝对效力，另外一种就是通过服务器端的禁止了。

nginx屏蔽垃圾无效蜘蛛

Robots协议：Robots协议（也称为爬虫协议、机器人协议等）的全称是“网络爬虫排除标准”（Robots Exclusion Protocol），网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不能抓取，比如：

User-agent:Googlebot
Disallow: /

Robots不是蜘蛛严格遵守，所以针对流氓蜘蛛需要使用第二种方法。

UA屏蔽：

服务器环境为Cenots+nginx,以此为例说明，在nginx的独立域名配置文件如下：

server {

// ... 其他配置 
location /{
     ###### 下面是添加的禁止某些UA访问的具体配置文件
    include  agent_deny.conf;
}

#禁止Scrapy等工具的抓取agent_deny.conf的具体内容如下：

location /{
   // ... 其他配置 
    if ($http_user_agent ~* "Applebot|SEOkicks-Robot|DotBot|YunGuanCe|Exabot|spiderman|Scrapy|HttpClient|Teleport|TeleportPro|SiteExplorer|WBSearchBot|Elefent|psbot|TurnitinBot|wsAnalyzer|ichiro|ezooms|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$"){
          return 403;
     }     
}
 
#禁止非GET|HEAD|POST方式的抓取
#这看情况设置
if ($request_method !~ ^(GET|HEAD|POST)$) {
    return 403;
}

如果有其他的User-agent想封锁，直接添加即可。