ruby爬虫笔记

安装

在CentOS中,安装 ruby 和 mysql 数据库。

# yum install ruby ruby-irb mysql mysql-server ruby-mysql

变量

全局变量用 $ 开头;
实例变量用 @ 开头;
局部变量直接来;

$global_variable = 10     # 全局变量
@cust_id=id      # 实例变量
var="hehe"       #局部变量

方法(函数)

def method_name [( [arg [= default]]...[, * arg [, &expr ]])]
   expr..
end

如果函数不需要参数,直接用名字就能调用。

method_name

Socket

require 'socket'      # Sockets 是标准库

hostname = 'localhost'
port = 2000

s = TCPSocket.open(hostname, port)

while line = s.gets   # 从 socket 中读取每行数据
  puts line.chop      # 打印到终端
end
s.close               # 关闭 socket

HTTP例子

require 'socket'
 
host = 'www.w3cschool.cc'     # web服务器
port = 80                           # 默认 HTTP 端口
path = "/index.htm"                 # 想要获取的文件地址

# 这是个 HTTP 请求
request = "GET #{path} HTTP/1.0\r\n\r\n"

socket = TCPSocket.open(host,port)  # 连接服务器
socket.print(request)               # 发送请求
response = socket.read              # 读取完整的响应
# Split response at first blank line into headers and body
headers,body = response.split("\r\n\r\n", 2)
print body                          # 输出结果

正则表达式

示例

#!/usr/bin/ruby

line1 = "Cats are smarter than dogs";
line2 = "Dogs also like meat";

if ( line1 =~ /Cats(.*)/ )
  puts "Line1 contains Cats"
end
if ( line2 =~ /Cats(.*)/ )
  puts "Line2 contains  Dogs"
end
def content_handle(kw,content,db)                                                                  
   
    # Put kw into Database
    db_result = db.query("INSERT INTO #{KW_TBL_NAME}(keyword) VALUES("#{kw}")")
   
    # Get more keywords
    result_div =  /<div id="rs">(.*?)<\/div><div id=/m.match(content) # Match <div id = "rs">
    if not result_div.respond_to?("[]") then return end
    result_kw = result_div[1].scan(/<a.*?>(.*?)<\/a>/m)     # Match keywords
    # Put keywords into to_visit.
    if result_kw.respond_to?("each") and @to_visit.length <= MAX_TO_VISIT
           result_kw.each do |rkw|
            @mutex.lock
                 @to_visit << rkw
            @mutex.unlock  
            puts "Got kw: #{rkw}\n"
           end
    end

多线程

# Multi-thread
t1 = Thread.new{fetch()}
t2 = Thread.new{fetch()}
t3 = Thread.new{fetch()}
t4 = Thread.new{fetch()}
t5 = Thread.new{fetch()}
t1.join
t2.join
t3.join
t4.join
t5.join

爬虫示例

https://github.com/feichashao/fetch_kw
抓取百度结果和关键字.

参考资料

http://rubylearning.com/satishtalim/ruby_socket_programming.html
http://www.w3cschool.cc/ruby/ruby-tutorial.html