Page History

...

I have exercised the Cloudera AMI package, requested 1 master+10 nodes. The task was to compute PageRank for large size set of interlinked pages. The abstract definition of the task is to fine iteratively solution of the matrix equation:
A*X=X
where A is a square matrix of the dimension N equal to # of wikipedia pages pointed by any wikipedia page. X is the vector of the same dimension describing the ultimate weight of the given page ( the Page-Rank value). The N of my problem was 1e6..1e7.

I was given a dump of all Wikipedia pages HM5,6 in the format:
<page><title>The Title</title><text>The page body</text></page>, one line of text per page, the (human typed in ) content was extremely non-homogenous, multi-lingual, with many random characters and typos.
I wrote 4 python string processing functions:

init converting input text to <key,value> format (my particular choice of the meaning )
mapp and reduce functions, run in pair, multiple iterations
finish function exporting final list of pages ordered by page rank.
I allocated the smallest (least expensive) CPUs at EC2 : ami=ami-6159bf08, instance_type=m1.small
The goal was to perform all ini + N_iter + fin steps using 10 nodes & hadoop framework.

Test 1: execution of the full chain for

...

ini +2 iter +

...

fin

using a ~10% sub-set of wikipedia pages (enwiki-20090929-one-page-per-line-part3)

the unzipped file had size of 2.2GB ASCII , contained 1.1M lines (original pages) which pointed to 14M pages (outgoing links, include self reference, non unique). After 1st iteration the # of lines (pages which are pointed to by any of the original ) grew to 5M pages and stabilized.
I brought part3.gz file to the master node & unzip it on the /mnt disk (has enough space (took few minutes)
I stick to the default choice to run 20 mappers and 10 reducers for every step (for 10-node cluster)
Timing results

copy local file to HDFS : ~2 minutes
init : 410 sec
mapp/reduce iter 0 : 300 sec
mapp/reduce iter 1 : 180 sec
finish : 190 sec
Total time was 20 minutes , 11 CPUs were involved.

Test 2: execution of a single map/reduce step on 27M linked pages,

using full set of wikipedia pages (enwiki-20090929-one-page-per-line-part1+2+3). I did minor modification of map/reduce code which could slow it down by ~20%-30%.

the unzipped file had size of 21 GB ASCII , contained 9M lines (original pages) which pointed to 142M pages (outgoing links, include self reference, non unique). After 1st iteration (which I run serially on a different machine) the # of lines (pages which are pointed to by any of the original ) grew to 27M pages.
I brought 1GB output of iteration 1 to the master node & unzip it on the /mnt disk (took 5 for scp and 5 for unzip)
I run 20 mappers and 10 reducers for every step (for 10-node cluster)
Timing results

copy local file to HDFS : ~10 minutes. Hadoop decided to divide the data in to 40 sets (and will issue 40 mapp jobs)
3 mapp jobs finished after 8 minutes.
5 mapp jobs finished after 16 minutes.
16 mapp jobs finished after 29 minutes.
all 40 mapp jobs finished after 42 minutes (one of the map jobs was restarted during this time)
reduce failed for all 10 jobs after ~5 minutes, all 10 ~simultaneously
hadoop tried twice to reissue the 10 sort+10 reduce jobs and it failed again after another ~5 minutes
At this stage I killed the cluster. It was consuming 11 CPU/hour and I had no clue how to debug it. I suspect some internal memory (HDFS ?) limit was not sufficient to hold sort results after mapp tasks. My estimate is 3GB of unzipped input could grew by a factor of few - may be there is a 10GB limit I should change (or pay extra?)

Child pages

Versions Compared

Old Version 4

New Version Current

Key

Test 1: execution of the full chain for

ini +2 iter +

fin

Test 2: execution of a single map/reduce step on 27M linked pages,