...
- init converting input text to <key,value> format (my particular choice of the meaning )
- mapp and reduce functions, run in pair, multiple iterations
- finish function exporting final list of pages ordered by page rank.
- I allocated the smallest (least expensive) CPUs at EC2 : ami=ami-6159bf08, instance_type=m1.small
The goal was to perform all ini + N_iter + fin steps using 10 nodes & hadoop framework.
Test 1: execution of the full chain for
...
ini +2 iter +
...
fin
using a ~10% sub-set of wikipedia pages (enwiki-20090929-one-page-per-line-part3)
...
- copy local file to HDFS : ~2 minutes
- init : 410 sec
- mapp/reduce iter 0 : 300 sec
- mapp/reduce iter 1 : 180 sec
- finish : 190 sec
Total time was 20 minutes , 11 CPUs were involved.
Test 2: execution of a single map/reduce step on 27M linked pages,
using full set of wikipedia pages (enwiki-20090929-one-page-per-line-part1+2+3). I did minor modification of map/reduce code which could slow it down by ~20%-30%.
...