...
I have exercised the Cloudera AMI package, requested 1 master+10 nodes. The task was to compute PageRank for large size set of interlinked pages. The abstract definition of the task is to fine iteratively solution of the matrix equation:
A*X=X
where A is a square matrix of the dimension N equal to # of wikipedia pages pointed by any wikipedia page. X is the vector of the same dimension describing the ultimate weight of the given page ( the Page-Rank value). The N of my problem was 1e6..1e7.
I was given a dump of all Wikipedia pages HM5,6 in the format:
<page><title>The Title</title><text>The page body</text></page>, one line of text per page, the (human typed in ) content was extremely non-homogenous, multi-lingual, with many random characters and typos.
I wrote 4 python string processing functions:
- init converting input text to <key,value> format (my particular choice of the meaning )
- mapp and reduce functions, run in pair, multiple iterations
- finish function exporting final list of pages ordered by page rank.
- I allocated the smallest (least expensive) CPUs at EC2 : ami=ami-6159bf08, instance_type=m1.small
The goal was to perform all ini + N_iter + fin steps using 10 nodes & hadoop framework.
Test 1: execution of the full chain for
...
ini +2 iter +
...
fin
using a ~10% sub-set of wikipedia pages (enwiki-20090929-one-page-per-line-part3)
- the unzipped file had size of 2.2GB ASCII , contained 1.1M lines (original pages) which pointed to 14M pages (outgoing links, include self reference, non unique). After 1st iteration the # of lines (pages which are pointed to by any of the original ) grew to 5M pages and stabilized.
- I brought part3.gz file to the master node & unzip it on the /mnt disk (has enough space (took few minutes)
- I stick to the default choice to run 20 mappers and 10 reducers for every step (for 10-node cluster)
Timing results
- copy local file to HDFS : ~2 minutes
- init : 410 sec
- mapp/reduce iter 0 : 300 sec
- mapp/reduce iter 1 : 180 sec
- finish : 190 sec
Total time was 20 minutes , 11 CPUs were involved.
Test 2: execution of a single map/reduce step on 27M linked pages,
using full set of wikipedia pages (enwiki-20090929-one-page-per-line-part1+2+3). I did minor modification of map/reduce code which could slow it down by ~20%-30%.
- the unzipped file had size of 21 GB ASCII , contained 9M lines (original pages) which pointed to 142M pages (outgoing links, include self reference, non unique). After 1st iteration (which I run serially on a different machine) the # of lines (pages which are pointed to by any of the original ) grew to 27M pages.
- I brought 1GB output of iteration 1 to the master node & unzip it on the /mnt disk (took 5 for scp and 5 for unzip)
- I run 20 mappers and 10 reducers for every step (for 10-node cluster)
Timing results
- copy local file to HDFS : ~10 minutes. Hadoop decided to divide the data in to 40 sets (and will issue 40 mapp jobs)
- 3 mapp jobs finished after 8 minutes.
- 5 mapp jobs finished after 16 minutes.
- 16 mapp jobs finished after 29 minutes.
- all 40 mapp jobs finished after 42 minutes (one of the map jobs was restarted during this time)
- reduce failed for all 10 jobs after ~5 minutes, all 10 ~simultaneously
- hadoop tried twice to reissue the 10 sort+10 reduce jobs and it failed again after another ~5 minutes
At this stage I killed the cluster. It was consuming 11 CPU/hour and I had no clue how to debug it. I suspect some internal memory (HDFS ?) limit was not sufficient to hold sort results after mapp tasks. My estimate is 3GB of unzipped input could grew by a factor of few - may be there is a 10GB limit I should change (or pay extra?)