吉吉于

构建一个基于del.icio.us的链接推荐系统

这一节介绍从delicious,一个在线书签网站获取数据,利用这些数据查找品味相似的用户,并向他们推荐未曾看过的链接。

关于delicious

同flickr一样,delicious均为标签类站点,一个帮助用户共享他们喜欢网站链接的流行网站。标签的作用就是可以让人们对某一条目进行标注,如 添加词语以及短语。但就delicious而言,所进行标注的条目换成了书签。Delicious提供了一种简单共享网页的方法,它为无数互联网用户提供共享及分类他们喜欢的网页书签。

Delicious

使用Delicious的API:

http://code.google.com/p/pydelicious/downloads/list

下载pydelicious.py

运行需要事先安装setuptools还有feedparse,请确保python的这两个工具包安装完毕。

运行如下:

import pydelicious
pydelicious.get_popular(tag=’music’)

得到一组近期张贴的有关music的热门链接

[{‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u’nakkheeran in’, ‘tags’: u’write’, ‘href’: u’http://nakkheeran.in/’, ‘user’: u’’, ‘dt’: ‘’}, {‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u’rionegro com ar’, ‘tags’: u’humor’, ‘href’: u’http://rionegro.com.ar/’, ‘user’: u’’, ‘dt’: ‘’}, {‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u’esposasymaridos com’, ‘tags’: u’loanspayday’, ‘href’: u’http://esposasymaridos.com/’, ‘user’: u’muzyleninun’, ‘dt’: ‘’}, {‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u’bmw ca’, ‘tags’: u’art’, ‘href’: u’http://bmw.ca/’, ‘user’: u’’, ‘dt’: ‘’}, {‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u’selfseo info’, ‘tags’: u’blogs’, ‘href’: u’http://selfseo.info/’, ‘user’: u’’, ‘dt’: ‘’}, {‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u’casino zia park hobbs nm’, ‘tags’: u’blogs’, ‘href’: u’http://223.easylooklikek.com/’, ‘user’: u’’, ‘dt’: ‘’}, {‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u’casino morongo job fair’, ‘tags’: u’Lifestyle’, ‘href’: u’http://107.fastseekingonse.com/’, ‘user’: u’’, ‘dt’: ‘’}, {‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u”Pok\xe9Beach.com, with ‘Pokemon Black’ and ‘Pokemon White’, Reshiram, Zekrom, Zoroark, Zorua, & ‘HS - Undaunted’ set info; your ultimate Pok\xe9mon TCG and Pok\xe9mon news site!”, ‘tags’: u’pokemon’, ‘href’: u’http://pokebeach.com/’, ‘user’: u’’, ‘dt’: ‘’}, {‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u’flog pl’, ‘tags’: u’daily’, ‘href’: u’http://flog.pl/’, ‘user’: u’’, ‘dt’: ‘’}, {‘count’: ‘’, ‘extended’: u’’, ‘hash’: ‘’, ‘description’: u’anwsion com’, ‘tags’: u’write’, ‘href’: u’http://anwsion.com/’, ‘user’: u’’, ‘dt’: ‘’}]

返回了一个包含字典的列表,每一项都包含了:URL,描述,提交者。

pydelicous里还有其他函数,比如get_urlposts:返回指定URL的所有张贴记录

get_userposts:返回给定用户的所有张贴记录

OK,下边我们构建一个基于del.icio.us的链接推荐系统

1.首先,我们需要利用pydelicious.py构造一个数据集,里边有多个用户,每个用户标记了其热爱的链接。

from pydelicious import get_popular,get_userposts,get_urlposts
import time

def initializeUserDict(tag,count=5):
  user_dict={}
  # get the top count' popular posts
  for p1 in get_popular(tag=tag)[0:count]:
    # find all users who posted this
    for p2 in get_urlposts(p1['href']):
      user=p2['user']
      user_dict[user]={}
  return user_dict

def fillItems(user_dict):
  all_items={}
  # Find links posted by all users
  for user in user_dict:
    for i in range(3):
      try:
        posts=get_userposts(user)
        break
      except:
        print "Failed user "+user+", retrying"
        time.sleep(4)
    for post in posts:
      url=post['href']
      user_dict[user][url]=1.0
      all_items[url]=1

  # Fill in missing items with 0
  for ratings in user_dict.values():
    for item in all_items:
      if item not in ratings:
        ratings[item]=0.0

保存为deliciousrec.py

运行:»>from deliciousrec import *

delusers=initializeUserDict(‘music’)

delusers[‘Lazynight’]={}

fillItems(delusers)

这样我们把获取的数据集保存到了delusers中

注意:获取数据集需要等待1分钟左右,量比较大,如果失败请重新尝试几次。

2.下边我们推荐给用户邻近的链接(下边的r为上节写过的r.py)

import random
user=delusers.keys()[random.randint(0,len(delusers)-1)]
user
u’nafosyxog’
r.topMatches(delusers,user)
[(0.14482758620689656, u’letorocoqet’), (0.03793103448275862, u’lydonahej’), (-0.042742263849460935, ‘Lazynight’), (-0.0526985286556622, u’tiqifojeg’), (-0.0526985286556622, u’syfasezeve’)]
r.getRecommendations(delusers,user)[0:10]
[(1.0, u’http://asus.com.cn/’), (0.7924528301886793, u’http://voici-news.fr/’), (0.7924528301886793, u’http://sheinside.com/’), (0.7924528301886793, u’http://setlktv.info/’), (0.7924528301886793, u’http://rpi.edu/’), (0.7924528301886793, u’http://imoneynews.com/’), (0.7924528301886793, u’http://desibbrg.com/’), (0.7924528301886793, u’http://cholotube.com/’), (0.20754716981132074, u’http://shahiya.com/’), (0.20754716981132074, u’http://pool.com/’)]
3.至此我们为delicious添加了一个推荐引擎~

下节说说基于物品的过滤~

 

转载请注明:于哲的博客 » 构建一个基于del.icio.us的链接推荐系统