View Source

EBS permanent disk - (is not reliable for me)

Nov 14 : EBS disk looses information after it is disconnected and reconnected. I used the following command

mkdir /storage
mount /dev/sdf1 /storage
cd /storage
ls

The file structure seems to be fine, but when I try to read some of the files I get this error for some of the files.

[root@ip-10-251-198-4 wL-pages-iter2]# cat * >/dev/null
cat: wL1part-00000: Input/output error
cat: wL1part-00004: Input/output error
cat: wL1part-00005: Input/output error
cat: wL1part-00009: Input/output error
cat: wL2part-00000: Input/output error
cat: wL2part-00004: Input/output error
[root@ip-10-251-198-4 wL-pages-iter2]# pwd
/storage/iter/bad/wL-pages-iter2

Note, EBS disk was partitioned & formatter with on the exactly the same operating system in the previous session

fdisk /dev/sdf
mkfs.ext3 /dev/sdf1

file transfer

Nov 13
*scp RCF --> Amazon, 3MB/sec, ~GB files;
*scp Amazon-->Amazon, 5-8 MB/sec, ~GB files

Problem w/ large (~0.5+ GB) file transfer: there are 2 types of disks:
- local volatile /mnt of size ~140GB
- permanent EBS storage (size ~$$$)
  scp of binary (xxx.gz) to EBS disk result with corruption (gunzip would complain). Once the file size was off by 1 bit (of 0.4GB). It was random, multiple transfers would succeed after several trails. If multiple scp were made simultaneously it would get worse.
  Once I change destination to /mnt disk and did one transfer at a time all probelms were gone - I scp 3 files of 1GB w/o a glitch. Later I copied files from /mnt to EBS disk took ~5 minutes per GB).

Nov 14: transfer of 1GB from rcf <--> Amazon takes ~5 minutes.

Launching nodes

Nov 13 :
*Matt's customized Ubuntu w/o STAR software - 4-6 minutes, the smallest machine $0.10
*default public Fedora from EC2 : ~2 minutes
*launching Cloudera cluster 1+4 or 1+10 seems to take similar time of ~5 minutes

Nov 14 :

*there is a limit of 20 on # of EC2 machines I could launch at once with the command: hadoop-ec2 launch-cluster my-hadoop-cluster19
'20' would not work. This is my

> cat */.hadoop-ec2/ec2-clusters.cfg
ami=ami-6159bf08
instance_type=m1.small
key_name=janAmazonKey2
*availability_zone=us-east-1a*
private_key=/home/training/.ec2/id_rsa-janAmazonKey2
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no

Make sure to assign proper zone if you use EBS disk

Computing speed

Task description

I have exercised the Cloudera AMI package, requested 1 master+10 nodes. The task was to compute PageRank for large size set of interlinked pages. The abstract definition of the task is to fine iteratively solution of the matrix equation:
A*X=X
where A is a square matrix of the dimension N equal to # of wikipedia pages pointed by any wikipedia page. X is the vector of the same dimension describing the ultimate weight of the given page ( the Page-Rank value). The N of my problem was 1e6..1e7.

I was given a dump of all Wikipedia pages HM5,6 in the format:
<page><title>The Title</title><text>The page body</text></page>, one line of text per page, the (human typed in ) content was extremely non-homogenous, multi-lingual, with many random characters and typos.
I wrote 4 python string processing functions:

init converting input text to <key,value> format (my particular choice of the meaning )
mapp and reduce functions, run in pair, multiple iterations
finish function exporting final list of pages ordered by page rank.
I allocated the smallest (least expensive) CPUs at EC2 : ami=ami-6159bf08, instance_type=m1.small
The goal was to perform all ini + N_iter + fin steps using 10 nodes & hadoop framework.

Test 1: execution of the full chain for ini +2 iter +fin

using a ~10% sub-set of wikipedia pages (enwiki-20090929-one-page-per-line-part3)

the unzipped file had size of 2.2GB ASCII , contained 1.1M lines (original pages) which pointed to 14M pages (outgoing links, include self reference, non unique). After 1st iteration the # of lines (pages which are pointed to by any of the original ) grew to 5M pages and stabilized.
I brought part3.gz file to the master node & unzip it on the /mnt disk (has enough space (took few minutes)
I stick to the default choice to run 20 mappers and 10 reducers for every step (for 10-node cluster)
Timing results

copy local file to HDFS : ~2 minutes
init : 410 sec
mapp/reduce iter 0 : 300 sec
mapp/reduce iter 1 : 180 sec
finish : 190 sec
Total time was 20 minutes , 11 CPUs were involved.

Test 2: execution of a single map/reduce step on 27M linked pages,

using full set of wikipedia pages (enwiki-20090929-one-page-per-line-part1+2+3). I did minor modification of map/reduce code which could slow it down by ~20%-30%.

the unzipped file had size of 21 GB ASCII , contained 9M lines (original pages) which pointed to 142M pages (outgoing links, include self reference, non unique). After 1st iteration (which I run serially on a different machine) the # of lines (pages which are pointed to by any of the original ) grew to 27M pages.
I brought 1GB output of iteration 1 to the master node & unzip it on the /mnt disk (took 5 for scp and 5 for unzip)
I run 20 mappers and 10 reducers for every step (for 10-node cluster)
Timing results

copy local file to HDFS : ~10 minutes. Hadoop decided to divide the data in to 40 sets (and will issue 40 mapp jobs)
3 mapp jobs finished after 8 minutes.
5 mapp jobs finished after 16 minutes.
16 mapp jobs finished after 29 minutes.
all 40 mapp jobs finished after 42 minutes (one of the map jobs was restarted during this time)
reduce failed for all 10 jobs after ~5 minutes, all 10 ~simultaneously
hadoop tried twice to reissue the 10 sort+10 reduce jobs and it failed again after another ~5 minutes
At this stage I killed the cluster. It was consuming 11 CPU/hour and I had no clue how to debug it. I suspect some internal memory (HDFS ?) limit was not sufficient to hold sort results after mapp tasks. My estimate is 3GB of unzipped input could grew by a factor of few - may be there is a 10GB limit I should change (or pay extra?)