<div id="cnblogs_post_body" class="blogpost-body"><h3><strong>什麼是robots.txt?</strong></h3><p>robots.txt是一個純文本文件,是爬蟲抓取網站的時候要查看的第一個文件,一般位於網站的根目錄下。robots ...
<div id="cnblogs_post_body" class="blogpost-body"><h3><strong>什麼是robots.txt?</strong></h3>
<p>robots.txt是一個純文本文件,是爬蟲抓取網站的時候要查看的第一個文件,一般位於網站的根目錄下。robots.txt文件定義了爬蟲在爬取該網站時存在的限制,哪些部分爬蟲可以爬取,哪些不可以爬取(防君子不防小人)</p>
<p>更多robots.txt協議信息參考:www.robotstxt.org</p>
<p>在爬取網站之前,檢查robots.txt文件可以最小化爬蟲被封禁的可能</p>
<p>下麵是百度robots.txt協議的一部分:https://www.baidu.com/robots.txt</p>
<div class="cnblogs_code"><div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="複製代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="複製代碼"></a></span></div>
<pre><span style="color: #008080;"> 1</span> <span style="color: #000000;">User-agent: Baiduspider
</span><span style="color: #008080;"> 2</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;"> 3</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;"> 4</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;"> 5</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;"> 6</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;"> 7</span>
<span style="color: #008080;"> 8</span> <span style="color: #000000;">User-agent: Googlebot
</span><span style="color: #008080;"> 9</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">10</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">11</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">12</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">13</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">14</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">15</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">16</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">17</span>
<span style="color: #008080;">18</span> <span style="color: #000000;">User-agent: MSNBot
</span><span style="color: #008080;">19</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">20</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">21</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">22</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">23</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">24</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">25</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">26</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">27</span>
<span style="color: #008080;">28</span> <span style="color: #000000;">User-agent: Baiduspider-image
</span><span style="color: #008080;">29</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">30</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">31</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">32</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">33</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">34</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">35</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">36</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">37</span>
<span style="color: #008080;">38</span> <span style="color: #000000;">User-agent: YoudaoBot
</span><span style="color: #008080;">39</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">40</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">41</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">42</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">43</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">44</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">45</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">46</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">47</span>
<span style="color: #008080;">48</span> <span style="color: #000000;">User-agent: Sogou spider2
</span><span style="color: #008080;">49</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">50</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">51</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">52</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">53</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">54</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">55</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">56</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">57</span>
<span style="color: #008080;">58</span> <span style="color: #000000;">User-agent: Sogou blog
</span><span style="color: #008080;">59</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">60</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">61</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">62</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">63</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">64</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">65</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">66</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">67</span>
<span style="color: #008080;">68</span> <span style="color: #000000;">User-agent: Sogou News Spider
</span><span style="color: #008080;">69</span> <span style="color: #000000;">Disallow: /baidu
</span><span style="color: #008080;">70</span> <span style="color: #000000;">Disallow: /s?
</span><span style="color: #008080;">71</span> <span style="color: #000000;">Disallow: /shifen/
</span><span style="color: #008080;">72</span> <span style="color: #000000;">Disallow: /homepage/
</span><span style="color: #008080;">73</span> <span style="color: #000000;">Disallow: /cpro
</span><span style="color: #008080;">74</span> <span style="color: #000000;">Disallow: /ulink?
</span><span style="color: #008080;">75</span> <span style="color: #000000;">Disallow: /link?
</span><span style="color: #008080;">76</span> <span style="color: #000000;">Disallow: /home/news/data/
</span><span style="color: #008080;">77</span>
78 <span style="color: #000000;">User-agent: *
</span>79 Disallow: /</pre>
<div class="cnblogs_code_toolbar"><span class="cnblogs_code_copy"><a href="javascript:void(0);" onclick="copyCnblogsCode(this)" title="複製代碼"><img src="//common.cnblogs.com/images/copycode.gif" alt="複製代碼"></a></span></div></div>
<p><span style="font-size: 15px;"><strong>robots.txt中的參數含義:</strong></span></p>
<p>1. User-agent:描述搜索引擎spider的名字。在“robots.txt“文件中,如果有多條 User-agent記錄,說明有多個robot會受到該協議的約束。所以,“robots.txt”文件中至少要有一條User- agent記錄。如果該項的值設為*(通配符),則該協議對任何搜索引擎機器人均有效。在“robots.txt”文件 中,“User-agent:*”這樣的記錄只能有一條。</p>
<p>2. Disallow: / 禁止訪問的路徑</p>
<p>例如,Disallow: /home/news/data/,代表爬蟲不能訪問/home/news/data/後的所有URL,但能訪問/home/news/data123</p>
<p>Disallow: /home/news/data,代表爬蟲不能訪問/home/news/data123、/home/news/datadasf等一系列以data開頭的URL。</p>
<p>前者是精確屏蔽,後者是相對屏蔽</p>
<p>3. Allow:/允許訪問的路徑</p>
<p>例如,Disallow:/home/後面有news、video、image等多個路徑</p>
<p>接著使用Allow:/home/news,代表禁止訪問/home/後的一切路徑,但可以訪問/home/news路徑</p>
<p> </p></div>