EBS permanent disk - (is not reliable for me)

Nov 14 : EBS disk looses information after it is disconnected and reconnected. I used the following command

mkdir /storage
mount /dev/sdf1 /storage
cd /storage
ls 

The file structure seems to be fine, but when I try to read some of the files I get this error for some of the files.

[root@ip-10-251-198-4 wL-pages-iter2]# cat * >/dev/null
cat: wL1part-00000: Input/output error
cat: wL1part-00004: Input/output error
cat: wL1part-00005: Input/output error
cat: wL1part-00009: Input/output error
cat: wL2part-00000: Input/output error
cat: wL2part-00004: Input/output error
[root@ip-10-251-198-4 wL-pages-iter2]# pwd
/storage/iter/bad/wL-pages-iter2

Note, EBS disk was partitioned & formatter with on the exactly the same operating system in the previous session

fdisk /dev/sdf
mkfs.ext3 /dev/sdf1

file transfer

Nov 13    
*scp RCF --> Amazon, 3MB/sec, ~GB files;
*scp Amazon-->Amazon, 5-8 MB/sec, ~GB files

Nov 14: transfer of 1GB from rcf <--> Amazon takes ~5 minutes.

Launching nodes

Nov 13 :
*Matt's customized Ubuntu w/o STAR software - 4-6 minutes, the smallest machine $0.10
*default public Fedora from EC2 : ~2 minutes
*launching Cloudera cluster 1+4 or 1+10 seems to take similar time of ~5 minutes

Nov 14 :

*there is a limit of 20 on # of EC2 machines I could launch at once with the command: hadoop-ec2 launch-cluster my-hadoop-cluster19
'20' would not work. This is my

> cat */.hadoop-ec2/ec2-clusters.cfg
ami=ami-6159bf08
instance_type=m1.small
key_name=janAmazonKey2
*availability_zone=us-east-1a*
private_key=/home/training/.ec2/id_rsa-janAmazonKey2
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no

Make sure to assign proper zone if you use EBS disk

Computing speed

Task description

I have exercised the Cloudera AMI package, requested 1 master+10 nodes. The task was to compute PageRank for large size set of interlinked pages. The abstract definition of the task is to fine iteratively solution of the matrix equation:
A*X=X
where A is a square matrix of the dimension N equal to # of wikipedia pages pointed by any wikipedia page. X is the vector of the same dimension describing the ultimate weight of the given page ( the Page-Rank value). The N of my problem was 1e6..1e7.

I was given a dump of all Wikipedia pages HM5,6 in the format:
<page><title>The Title</title><text>The page body</text></page>, one line of text per page, the (human typed in ) content was extremely non-homogenous, multi-lingual, with many random characters and typos.
I wrote 4 python string processing functions:

  1. init converting input text to <key,value> format (my particular choice of the meaning )
  2. mapp and reduce functions, run in pair, multiple iterations
  3. finish function exporting final list of pages ordered by page rank.
  4. I allocated the smallest (least expensive) CPUs at EC2 : ami=ami-6159bf08, instance_type=m1.small
    The goal was to perform all ini + N_iter + fin steps using 10 nodes & hadoop framework.
Test 1: execution of the full chain for ini +2 iter +fin

using a ~10% sub-set of wikipedia pages (enwiki-20090929-one-page-per-line-part3)

  1. copy local file to HDFS : ~2 minutes
  2. init : 410 sec
  3. mapp/reduce iter 0 : 300 sec
  4. mapp/reduce iter 1 : 180 sec
  5. finish : 190 sec
    Total time was 20 minutes , 11 CPUs were involved.
Test 2: execution of a single map/reduce step on 27M linked pages,

using full set of wikipedia pages (enwiki-20090929-one-page-per-line-part1+2+3). I did minor modification of map/reduce code which could slow it down by ~20%-30%.

  1. copy local file to HDFS : ~10 minutes. Hadoop decided to divide the data in to 40 sets (and will issue 40 mapp jobs)
  2. 3 mapp jobs finished after 8 minutes.
  3. 5 mapp jobs finished after 16 minutes.
  4. 16 mapp jobs finished after 29 minutes.
  5. all 40 mapp jobs finished after 42 minutes (one of the map jobs was restarted during this time)
  6. reduce failed for all 10 jobs after ~5 minutes, all 10 ~simultaneously
  7. hadoop tried twice to reissue the 10 sort+10 reduce jobs and it failed again after another ~5 minutes
    At this stage I killed the cluster. It was consuming 11 CPU/hour and I had no clue how to debug it. I suspect some internal memory (HDFS ?) limit was not sufficient to hold sort results after mapp tasks. My estimate is 3GB of unzipped input could grew by a factor of few - may be there is a 10GB limit I should change (or pay extra?)