ruby爬虫笔记 – 肥叉烧 feichashao.com

安装

在CentOS中，安装 ruby 和 mysql 数据库。

# yum install ruby ruby-irb mysql mysql-server ruby-mysql

变量

全局变量用 $ 开头；
实例变量用 @ 开头；
局部变量直接来；

$global_variable = 10 # 全局变量
@cust_id=id # 实例变量
var="hehe" #局部变量

方法（函数）

def method_name [( [arg [= default]]...[, * arg [, &expr ]])]
expr..
end

如果函数不需要参数，直接用名字就能调用。

method_name

Socket

require 'socket' # Sockets 是标准库

hostname = 'localhost'
port = 2000

s = TCPSocket.open(hostname, port)

while line = s.gets # 从 socket 中读取每行数据
puts line.chop # 打印到终端
end
s.close # 关闭 socket

HTTP例子

require 'socket'

host = 'www.w3cschool.cc' # web服务器
port = 80 # 默认 HTTP 端口
path = "/index.htm" # 想要获取的文件地址

# 这是个 HTTP 请求
request = "GET #{path} HTTP/1.0\r\n\r\n"

socket = TCPSocket.open(host,port) # 连接服务器
socket.print(request) # 发送请求
response = socket.read # 读取完整的响应
# Split response at first blank line into headers and body
headers,body = response.split("\r\n\r\n", 2)
print body # 输出结果

正则表达式

示例

#!/usr/bin/ruby

line1 = "Cats are smarter than dogs";
line2 = "Dogs also like meat";

if ( line1 =~ /Cats(.*)/ )
puts "Line1 contains Cats"
end
if ( line2 =~ /Cats(.*)/ )
puts "Line2 contains Dogs"
end

def content_handle(kw,content,db)

# Put kw into Database
db_result = db.query("INSERT INTO #{KW_TBL_NAME}(keyword) VALUES("#{kw}")")

# Get more keywords
result_div = /<div id="rs">(.*?)<\/div><div id=/m.match(content) # Match <div id = "rs">
if not result_div.respond_to?("[]") then return end
result_kw = result_div[1].scan(/<a.*?>(.*?)<\/a>/m) # Match keywords
# Put keywords into to_visit.
if result_kw.respond_to?("each") and @to_visit.length <= MAX_TO_VISIT
result_kw.each do |rkw|
@mutex.lock
@to_visit << rkw
@mutex.unlock
puts "Got kw: #{rkw}\n"
end
end

多线程

# Multi-thread
t1 = Thread.new{fetch()}
t2 = Thread.new{fetch()}
t3 = Thread.new{fetch()}
t4 = Thread.new{fetch()}
t5 = Thread.new{fetch()}
t1.join
t2.join
t3.join
t4.join
t5.join

爬虫示例

https://github.com/feichashao/fetch_kw
抓取百度结果和关键字.

参考资料

http://rubylearning.com/satishtalim/ruby_socket_programming.html
http://www.w3cschool.cc/ruby/ruby-tutorial.html