【Netflix】Netflix Prize 竞赛数据集

09 Dec 2013

netflix不提供数据下载了，从网上down的一份，分享之。

NewImage

下载：http://pan.baidu.com/s/1hAdpU

相关：DVD在线租赁商 Netflix 于 2006 年 10 月 2 日发起一项竞赛：Netflix Prize，任何组织或个人只要能够提交比它现有电影推荐系统 Cinematch 效果好 10% 的新方法，就可以获得一百万美元的奖金。竞赛最多持续到 2011 年 10 月 2 日。同时，Netflix Prize 还提供每年五万美元的年度进步奖。2007 年年度进步奖由来自 AT&T 的 BellKor 小组夺得。

官网：http://www.netflixprize.com/

**
**

README:

SUMMARY

This dataset was constructed to support participants in the Netflix Prize. See
http://www.netflixprize.com for details about the prize.

The movie rating files contain over 100 million ratings from 480 thousand
randomly-chosen, anonymous Netflix customers over 17 thousand movie titles. The
data were collected between October, 1998 and December, 2005 and reflect the
distribution of all ratings received during this period. The ratings are on a
scale from 1 to 5 (integral) stars. To protect customer privacy, each customer
id has been replaced with a randomly-assigned id. The date of each rating and
the title and year of release for each movie id are also provided.

USAGE LICENSE

Netflix can not guarantee the correctness of the data, its suitability for any
particular purpose, or the validity of results based on the use of the data set.
The data set may be used for any research purposes under the following
conditions:

The user may not state or imply any endorsement from Netflix.
The user must acknowledge the use of the data set in
publications resulting from the use of the data set, and must
send us an electronic or paper copy of those publications.
The user may not redistribute the data without separate
permission.
The user may not use this information for any commercial or
revenue-bearing purposes without first obtaining permission
from Netflix.

If you have any further questions or comments, please contact the Prize
administrator prizemaster@netflix.com

TRAINING DATASET FILE DESCRIPTION

The file “training_set.tar” is a tar of a directory containing 17770 files, one
per movie. The first line of each file contains the movie id followed by a
colon. Each subsequent line in the file corresponds to a rating from a customer
and its date in the following format:

CustomerID,Rating,Date

MovieIDs range from 1 to 17770 sequentially.
CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
Ratings are on a five star (integral) scale from 1 to 5.
Dates have the format YYYY-MM-DD.

MOVIES FILE DESCRIPTION

Movie information in “movie_titles.txt” is in the following format:

MovieID,YearOfRelease,Title

MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
YearOfRelease can range from 1890 to 2005 and may correspond to the release of
corresponding DVD, not necessarily its theaterical release.
Title is the Netflix movie title and may not correspond to
titles used on other sites. Titles are in English.

QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

The qualifying dataset for the Netflix Prize is contained in the text file
“qualifying.txt”. It consists of lines indicating a movie id, followed by a
colon, and then customer ids and rating dates, one per line for that movie id.
The movie and customer ids are contained in the training set. Of course the
ratings are withheld. There are no empty lines in the file.

MovieID1:
CustomerID11,Date11
CustomerID12,Date12
…
MovieID2:
CustomerID21,Date21
CustomerID22,Date22

For the Netflix Prize, your program must predict the all ratings the customers
gave the movies in the qualifying dataset based on the information in the
training dataset.

The format of your submitted prediction file follows the movie and customer id,
date order of the qualifying dataset. However, your predicted rating takes the
place of the corresponding customer id (and date), one per line.

For example, if the qualifying dataset looked like:

111:
3245,2005-12-19
5666,2005-12-23
6789,2005-03-14
225:
1234,2005-05-26
3456,2005-11-07

then a prediction file should look something like:
111:
3.0
3.4
4.0
225:
1.0
2.0

which predicts that customer 3245 would have rated movie 111 3.0 stars on the
19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher
at 3.4 stars on the 23rd of Decemeber, 2005, etc.

You must make predictions for all customers for all movies in the qualifying
dataset.

THE PROBE DATASET FILE DESCRIPTION

To allow you to test your system before you submit a prediction set based on the
qualifying dataset, we have provided a probe dataset in the file “probe.txt”.
This text file contains lines indicating a movie id, followed by a colon, and
then customer ids, one per line for that movie id.

MovieID1:
CustomerID11
CustomerID12
…
MovieID2:
CustomerID21
CustomerID22

Like the qualifying dataset, the movie and customer id pairs are contained in
the training set. However, unlike the qualifying dataset, the ratings (and
dates) for each pair are contained in the training dataset.

If you wish, you may calculate the RMSE of your predictions against those
ratings and compare your RMSE against the Cinematch RMSE on the same data. See
http://www.netflixprize.com/faq#probe for that value.

Good luck!

MD5 SIGNATURES AND FILE SIZES

d2b86d3d9ba8b491d62a85c9cf6aea39 577547 movie_titles.txt
ed843ae92adbc70db64edbf825024514 10782692 probe.txt
88be8340ad7b3c31dfd7b6f87e7b9022 52452386 qualifying.txt
0e13d39f97b93e2534104afc3408c68c 567 rmse.pl
0098ee8997ffda361a59bc0dd1bdad8b 2081556480 training_set.tar

====================================分割线======================================

以下是个人对数据集格式的理解

统计

1亿打分

480189用户

17770电影

格式：

training_set

MovieID：

CustomerID,Rating,Date

mv_0000001.txt

1488844,3,2005-09-06

822109,5,2005-05-13

885013,4,2005-10-19

30878,4,2005-12-26

823519,3,2004-05-03

893988,3,2005-11-17

124105,4,2004-08-05

1248029,3,2004-04-22

mv_0000002.txt

2059652,4,2005-09-05

1666394,3,2005-04-19

1759415,4,2005-04-22

1959936,5,2005-11-21

998862,4,2004-11-13

2625420,2,2004-12-06

573975,3,2005-07-21

movie_titles.txt

MovieID,YearOfRelease,Title

1,2003,Dinosaur Planet

2,2004,Isle of Man TT 2004 Review

3,1997,Character

4,1994,Paula Abdul’s Get Up & Dance

5,2004,The Rise and Fall of ECW

6,1997,Sick

7,1992,8 Man

8,2004,What the #$*! Do We Know!?

qualifying.txt

参赛团队提交整个Qualifying Set上的预测评分值

MovieID1:

CustomerID11,Date11

CustomerID12,Date12

…

MovieID2:

CustomerID21,Date21

CustomerID22,Date22

1046323,2005-12-19

1080030,2005-12-23

1830096,2005-03-14

368059,2005-05-26

802003,2005-11-07

513509,2005-07-04

probe.txt

用来对照检测你对qualifying.txt预测的结果

MovieID1:

CustomerID11

CustomerID12

…

MovieID2:

CustomerID21

CustomerID22

30878

2647871

1283744

2488120

317050

1904905

1989766

14756

**
**

转载请注明：于哲的博客 » 【Netflix】Netflix Prize 竞赛数据集

吉吉于

free

【Netflix】Netflix Prize 竞赛数据集

SUMMARY

USAGE LICENSE

TRAINING DATASET FILE DESCRIPTION

MOVIES FILE DESCRIPTION

QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

THE PROBE DATASET FILE DESCRIPTION

MD5 SIGNATURES AND FILE SIZES

【Netflix】Netflix Prize 竞赛数据集

SUMMARY

USAGE LICENSE

TRAINING DATASET FILE DESCRIPTION

MOVIES FILE DESCRIPTION

QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

THE PROBE DATASET FILE DESCRIPTION

MD5 SIGNATURES AND FILE SIZES

Related Posts