首页 > 技术杂记, 搜索引擎 > robots.txt协议中User-agent的匹配

robots.txt协议中User-agent的匹配

之前写过一篇关于robots.txt协议(http://hi.baidu.com/wuzsh/blog/item/cef1fc03f6ff54723912bbbe.html)的文章 ,今天收到站长投诉,说是我们的spider不遵守robots.txt协议。把它们的robots.txt下下来测试一下,果然,以前的robots.txt程序有个bug。主要是没把robots.txt中user-agent的匹配弄好,于是再翻出来看看,下面是原文:(摘自http://www.robotstxt.org/norobots-rfc.txt
These name tokens are used in User-agent lines in /robots.txt to  identify to which specific robots

the record applies.The robot must obey the first record in /robots.txt that contains a User-Agent

line whose value contains the name token of the robot as a substring. The name comparisons

are case-insensitive. If no such record exists, it should obey the first record with a User-agent

line with a “*” value, if present. If no record satisfied either condition, or no records are present

at all, access is unlimited.

总结上面的内容,robots.txt协议中的匹配规则为:

1.优先精确匹配(精确匹配是指匹配robot/spider的名字)
2.接下来模糊匹配,匹配‘*’
3.如果有指定精确匹配,就不再匹配’*’

  1. 本文目前尚无任何评论.
  1. 本文目前尚无任何 trackbacks 和 pingbacks.
您必须在 登录 后才能发布评论.