Notes on EC2 performance

EBS permanent disk - (is not reliable for me)

Nov 14 : EBS disk looses information after it is disconnected and reconnected. I used the following command

mkdir /storage
mount /dev/sdf1 /storage
cd /storage
ls

The file structure seems to be fine, but when I try to read some of the files I get this error for some of the files.

[root@ip-10-251-198-4 wL-pages-iter2]# cat * >/dev/null
cat: wL1part-00000: Input/output error
cat: wL1part-00004: Input/output error
cat: wL1part-00005: Input/output error
cat: wL1part-00009: Input/output error
cat: wL2part-00000: Input/output error
cat: wL2part-00004: Input/output error
[root@ip-10-251-198-4 wL-pages-iter2]# pwd
/storage/iter/bad/wL-pages-iter2

Note, EBS disk was partitioned & formatter with on the exactly the same operating system in the previous session

fdisk /dev/sdf
mkfs.ext3 /dev/sdf1

file transfer

Nov 13
*scp RCF --> Amazon, 3MB/sec, ~GB files;
*scp Amazon-->Amazon, 5-8 MB/sec, ~GB files

Problem w/ large (~0.5+ GB) file transfer: there are 2 types of disks:
- local volatile /mnt of size ~140GB
- permanent EBS storage (size ~$$$)
  scp of binary (xxx.gz) to EBS disk result with corruption (gunzip would complain). Once the file size was off by 1 bit (of 0.4GB). It was random, multiple transfers would succeed after several trails. If multiple scp were made simultaneously it would get worse.
  Once I change destination to /mnt disk and did one transfer at a time all probelms were gone - I scp 3 files of 1GB w/o a glitch. Later I copied files from /mnt to EBS disk took ~5 minutes per GB).

Nov 14: transfer of 1GB from rcf <--> Amazon takes ~5 minutes.

Launching nodes

Nov 13 :
*Matt's customized Ubuntu w/o STAR software - 4-6 minutes, the smallest machine $0.10
*default public Fedora from EC2 : ~2 minutes
*launching Cloudera cluster 1+4 or 1+10 seems to take similar time of ~5 minutes

Nov 14 :

*there is a limit of 20 on # of EC2 machines I could launch at once with the command: hadoop-ec2 launch-cluster my-hadoop-cluster19
'20' would not work. This is my

> cat */.hadoop-ec2/ec2-clusters.cfg
ami=ami-6159bf08
instance_type=m1.small
key_name=janAmazonKey2
*availability_zone=us-east-1a*
private_key=/home/training/.ec2/id_rsa-janAmazonKey2
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no

Make sure to assign proper zone if you use EBS disk

Computing speed

Task description

I have exercised the Cloudera AMI package, requested 1 master+10 nodes. The task was to compute PageRank for large size set of interlinked pages.
I was given a dump of all Wikipedia pages HM5,6 in the format:
<page><title>The Title</title><text>The page body</text></page>, one line of text per page, the (human typed in ) content was extremely non-homogenous, multi-lingual, with many random characters and typos.
I wrote 4 python string processing functions:

init converting input text to <key,value> format (my particular choice of the meaning )
mapp and reduce functions, run in pair, multiple iterations
finish function exporting final list of pages ordered by page rank.
I allocated the smallest (least expensive) CPUs at EC2 : ami=ami-6159bf08, instance_type=m1.small
The goal was to perform all ini + N_iter + fin steps using 10 nodes & hadoop framework.

Test 1: execution of the full chain for 1+2 iter+1 job, using a ~10% sub-set of wikipedia pages (enwiki-20090929-one-page-per-line-part3)

the unzipped file had size of 2.2GB ASCII , contained 1.1M lines (original pages) which pointed to 14M pages (outgoing links, include self reference, non unique). After 1st iteration the # of lines (pages which are pointed to by any of the original ) grew to 5M pages and stabilized.
I brought part3.gz file to the master node & unzip it on the /mnt disk (has enough space (took few minutes)
I stick to the default choice to run 20 mappers and 10 reducers for every step (for 10-node cluster)
Timing results

copy local file to HDFS : ~2 minutes
init : 410 sec
mapp/reduce iter 0 : 300 sec
mapp/reduce iter 1 : 180 sec
finish : 190 sec
Total time was 20 minutes , 11 CPUs were involved.

Child pages