using R XML package
R crawl test
library(XML)
library(stringr)
totalfile=NULL
for (i in 1:247){
url=paste0("http://www.0933.me/user/40733/share/p/",i,".html")
x=readHTMLList(url)[[3]]
la=str_locate(pattern=".pdf",x)[,1]
 
p=str_sub(x,1,la-1)
myfile=data.frame(x=i,y=p)
totalfile=rbind(totalfile,myfile)}
library(stringr)
totalfile=NULL
for (i in 1:247){
url=paste0("http://www.0933.me/user/40733/share/p/",i,".html")
x=readHTMLList(url)[[3]]
la=str_locate(pattern=".pdf",x)[,1]
p=str_sub(x,1,la-1)
myfile=data.frame(x=i,y=p)
totalfile=rbind(totalfile,myfile)}
library(DT)
datatable(totalfile)
datatable(totalfile)
Show 10 entries
Search:
| 
x | 
y | |
| 
1 | 
1 | 
并购之路:20个世界500强企业的并购历程_12091959 | 
| 
2 | 
1 | 
圆明园的“记忆遗产”样式房图档635_12802785 | 
| 
3 | 
1 | 
明清吴语词典 | 
| 
4 | 
1 | 
结晶学导论第二版_12612940 | 
| 
5 | 
1 | 
结晶学导论第二版_12612940 | 
| 
6 | 
1 | 
命好不如习惯好_11072931_哈尔滨市:哈尔滨出版社_2002_郭腾尹著_Pg196 | 
| 
7 | 
1 | 
中国近代航运史资料第一辑下册1840-1895A5.1542_80407929 | 
| 
8 | 
1 | 
成人教育心理学_10824248 | 
| 
9 | 
1 | 
中国经济昆虫志第30册膜翅目胡蜂总科_10507895 | 
| 
10 | 
1 | 
洞见世界最富创意的广告公司BBDO_12784681 | 
Showing 1 to 10 of 4,939 entries