For collecting and analyzing data. 【啟示】本處所分享的內容均是筆者從一些專業書籍中學習所得,也許會有一些自己使用過程中的技巧、心得、小經驗一類的,但遠比不上書中所講述的精彩翔實。只因自己在學習過程中深感在R爬蟲應用中互聯網可搜索的公開資源並不如其它知識豐富,特此稍 ...
For collecting and analyzing data.
【啟示】本處所分享的內容均是筆者從一些專業書籍中學習所得,也許會有一些自己使用過程中的技巧、心得、小經驗一類的,但遠比不上書中所講述的精彩翔實。只因自己在學習過程中深感在R爬蟲應用中互聯網可搜索的公開資源並不如其它知識豐富,特此稍作分享以供後來者鑒,也因此關於這一塊的內容不做原創聲明,歡迎朋友們一起交流學習、批評指正,以期共同進步。EMAIL:[email protected]
1.WHY R?
即使對於非專業人員而言,也多少耳聞目前的R在爬蟲應用的表現也遠不如其它軟體,R既非專業適合的軟體、而八爪魚一類的簡單應用也完全可以滿足我們這些"偶爾的用戶",那麼為什麼需要用R爬蟲呢?我認為每一個來搜索R爬蟲技巧的朋友都有自己的答案。
提醒幾個個優勢:
#1.FOR a software environment with a primarily statistical focus.
#2.there will be an amazing visual work.
#May be a complete set of operational procedures.
2.About basics.
we need threw ourselves into the preparation with some basic knowledge of HTML, XML and the logic of regular expressions and Xpath, BUT the operations are executed from WIHTIN R!
3.RECOMMENDATION
http://www.r-datacollection.com
4.A little case study.
#爬取電影票房信息 library(stringr) library(XML) library(maps) #htmlParse()用來interpreting HTML #創建一個object movie_parsed<-htmlParse("http://58921.com/boxoffice/wangpiao/20161004", encoding = "UTF-8") #the next step:extract tables/data #readHTMLTable() for identifying and reading out those tables tables<-readHTMLTable(movie_parsed,stringsAsFactors=FALSE) is.matrix(tables) is.character(tables) is.data.frame(tables) is.list(tables) #so we got an "list" format#
因為R對於中文的支持不是很好,所以碰到一些中文亂碼是正常的,所以我們需要more advanced text manipulation tools.(本例中出現了部分列信息的完全丟失是因為該網站的某些列的數據是以.png格式放置的。)
5.ABC's of...
For browsing the Web, there is a hidden standard behind the scenes that structures how information is displayed.
#HTML or the hypertext markup language
Not a dedicated data storage format, but usually contains the useful information. And in general HTML is used to shape the display of information.
#XML the extensible markup language or XML
The main purpose of XML is to storage data. Thus HTML documents are interpreted and transformed in to pretty-looking output by browsers, whereas XML is "just" data wrapped in user-defined tags. The user-defined tags make XML much more flexible for storing data than HTML. Both HTML and XML-style document offer natrual, often hierarchical, structures for data storage.
(unfinished......)
#JSON or JavaScript Object Notation
基於JavaScript語言的輕量級的數據交換格式
#AJAX or "Asynchronous JavaScript and XML"
____________________________________________________________________________________________
HTTP | R |
XML/HTML | XPath |
JSON | JSON parsers |
AJAX | Selenuim |
Plain text | Regular expressions |