ruby爬虫笔记

安装

在CentOS中,安装 ruby 和 mysql 数据库。
[cc lang=”text”]
# yum install ruby ruby-irb mysql mysql-server ruby-mysql
[/cc]

变量

全局变量用 $ 开头;
实例变量用 @ 开头;
局部变量直接来;

[cc lang=”ruby”]
$global_variable = 10 # 全局变量
@cust_id=id # 实例变量
var=”hehe” #局部变量
[/cc]

方法(函数)

[cc lang=”ruby”]
def method_name [( [arg [= default]]…[, * arg [, &expr ]])]
expr..
end
[/cc]

如果函数不需要参数,直接用名字就能调用。
[cc lang=”ruby”]
method_name
[/cc]

Socket

[cc lang=”ruby”]
require ‘socket’ # Sockets 是标准库

hostname = ‘localhost’
port = 2000

s = TCPSocket.open(hostname, port)

while line = s.gets # 从 socket 中读取每行数据
puts line.chop # 打印到终端
end
s.close # 关闭 socket
[/cc]

HTTP例子
[cc lang=”ruby”]
require ‘socket’

host = ‘www.w3cschool.cc’ # web服务器
port = 80 # 默认 HTTP 端口
path = “/index.htm” # 想要获取的文件地址

# 这是个 HTTP 请求
request = “GET #{path} HTTP/1.0\r\n\r\n”

socket = TCPSocket.open(host,port) # 连接服务器
socket.print(request) # 发送请求
response = socket.read # 读取完整的响应
# Split response at first blank line into headers and body
headers,body = response.split(“\r\n\r\n”, 2)
print body # 输出结果
[/cc]

正则表达式

示例
[cc lang=”ruby”]
#!/usr/bin/ruby

line1 = “Cats are smarter than dogs”;
line2 = “Dogs also like meat”;

if ( line1 =~ /Cats(.*)/ )
puts “Line1 contains Cats”
end
if ( line2 =~ /Cats(.*)/ )
puts “Line2 contains Dogs”
end
[/cc]

[cc lang=”ruby”]
def content_handle(kw,content,db)

# Put kw into Database
db_result = db.query(“INSERT INTO #{KW_TBL_NAME}(keyword) VALUES(\”#{kw}\”)”)

# Get more keywords
result_div = /

(.*?)<\/div>

if not result_div.respond_to?(“[]”) then return end
result_kw = result_div[1].scan(/(.*?)<\/a>/m) # Match keywords
# Put keywords into to_visit.
if result_kw.respond_to?(“each”) and @to_visit.length <= MAX_TO_VISIT result_kw.each do |rkw| @mutex.lock @to_visit << rkw @mutex.unlock puts "Got kw: #{rkw}\n" end end [/cc]

多线程

[cc lang=”ruby”]
# Multi-thread
t1 = Thread.new{fetch()}
t2 = Thread.new{fetch()}
t3 = Thread.new{fetch()}
t4 = Thread.new{fetch()}
t5 = Thread.new{fetch()}
t1.join
t2.join
t3.join
t4.join
t5.join
[/cc]

爬虫示例

https://github.com/feichashao/fetch_kw
抓取百度结果和关键字.

参考资料

http://rubylearning.com/satishtalim/ruby_socket_programming.html
http://www.w3cschool.cc/ruby/ruby-tutorial.html