Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0
Wiki Markup
h4. EBS permanent disk - (is not reliable for me)

...



Nov 14 : EBS disk looses information after it is disconnected and reconnected. I used the following command

...


{code
}mkdir /storage
mount /dev/sdf1 /storage
cd /storage
ls {code}
The file structure seems to be fine, but when I try to read some of the files I get this error for *some* of the files.

...


{code
}[root@ip-10-251-198-4 wL-pages-iter2]# cat * >/dev/null
cat: wL1part-00000: Input/output error
cat: wL1part-00004: Input/output error
cat: wL1part-00005: Input/output error
cat: wL1part-00009: Input/output error
cat: wL2part-00000: Input/output error
cat: wL2part-00004: Input/output error
[root@ip-10-251-198-4 wL-pages-iter2]# pwd
/storage/iter/bad/wL-pages-iter2
{code}
Note, EBS disk was partitioned & formatter with on the exactly the same operating system in the previous session

...


{code
}fdisk /dev/sdf
mkfs.ext3 /dev/sdf1

file transfer

...

{code}


h4. file transfer

Nov 13    
\*scp RCF \--> Amazon, 3MB/sec, \~GB files;

...


\*scp Amazon-->Amazon, 5-8 MB/sec, \~GB files

...


* Problem w/ large (~0.5\+ GB) file transfer: there are 2 types of disks:

...


** local volatile /mnt of size

...

 \~140GB
** permanent EBS storage (size \~$$$)
scp of binary (xxx.gz) to EBS disk result with corruption (gunzip would complain). Once the file size was off by 1 bit (of 0.4GB). It was random, multiple transfers would succeed after several trails. If multiple scp were made simultaneously it would get worse.

...


Once I change destination to /mnt disk and did one transfer at a time all probelms were gone - I scp 3 files of 1GB w/o a glitch. Later I copied files from /mnt to EBS disk took \~5 minutes per GB).

...



Nov 14: transfer of 1GB from rcf <--> Amazon takes ~5 minutes.

...

Launching nodes

...




h4. Launching nodes

Nov 13 :
\*Matt's customized Ubuntu w/o STAR software - 4-6 minutes, the smallest machine $0.10

...


\*default public Fedora from EC2 : \~2 minutes

...


\*launching Cloudera cluster 1+4 or 1+10 seems to take similar time of \~5 minutes

...



Nov 14 :

...



\*there is a limit of 20 on # of EC2 machines I could launch at once with the command: *hadoop-ec2 launch-cluster my-hadoop-cluster19

...

*
'20' would not work. This is my

...


{code
}
> cat */.hadoop-ec2/ec2-clusters.cfg
ami=ami-6159bf08
instance_type=m1.small
key_name=janAmazonKey2
*availability_zone=us-east-1a*
private_key=/home/training/.ec2/id_rsa-janAmazonKey2
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
{code}
Make sure to assign proper zone if you use EBS disk

...



h4. Computing speed

...


h5. Task description

...


I have exercised the Cloudera AMI package, requested 1 master+10 nodes. The task was to compute  PageRank for large size set of interlinked pages. The abstract definition of the task is to fine iteratively solution of the matrix equation:

...


A*X=X

...


where A is a square matrix of the dimension N equal to # of wikipedia pages pointed by any wikipedia page. X is the vector of the same dimension describing the ultimate weight of the given page ( the Page-Rank value). The N of my problem was  1e6..1e7.

...



I was given a dump of all Wikipedia pages [HM5,6 |http://www.cs264.org/homework.html] in the format:

...


*<page><title>The Title</title><text>The page body</text></page>*, one line of text per page, the (human typed in ) content was extremely non-homogenous,  multi-lingual, with many random characters and typos.

...

 
I wrote 4 python string processing functions:

...


# *init* converting input text to <key,value> format (my particular choice of the meaning )

...


# *mapp and reduce*  functions, run in pair, multiple iterations

...


# *finish* function exporting final list of pages ordered by page rank.
# I allocated the smallest (least expensive) CPUs at EC2 : *ami=ami-6159bf08,  instance_type=m1.small

...

*
The goal was to perform all *ini + N_iter + fin*  steps using 10 nodes & hadoop framework.

...

 

h5. Test 1: execution of the full chain for *ini +2 iter +fin

...

*  
using a ~10% sub-set of wikipedia pages (enwiki-20090929-one-page-per-line-part3)

...


* the unzipped file had size of 2.2GB ASCII , contained 1.1M lines (original pages)  which pointed to 14M pages (outgoing links, include self reference, non unique). After 1st iteration the # of lines (pages which are pointed to by any of the original ) grew to 5M pages and stabilized.

...


* I brought  part3.gz file to the master node  & unzip it on the /mnt disk (has enough space (took few minutes)

...


* I stick to the default choice  to run 20 mappers and 10 reducers for every step (for 10-node cluster)

...


*Timing results

...

*
# copy local file to HDFS :  ~2 minutes

...


# init : 410 sec

...


# mapp/reduce iter 0 :  300 sec

...


# mapp/reduce iter 1 :  180 sec

...


# finish : 190 sec

...


Total time was 20 minutes , 11 CPUs were involved.

...



h5. Test 2: execution of a single  map/reduce step  on 27M linked pages,

...


using full set of  wikipedia pages (enwiki-20090929-one-page-per-line-part1+2+3). I did minor modification of map/reduce code which could slow it down by ~20%-30%.

...


* the unzipped file had size of 21 GB ASCII , contained 9M lines (original pages)  which pointed to 142M pages (outgoing links, include self reference, non unique). After 1st iteration (which I run serially on a different machine) the # of lines (pages which are pointed to by any of the original ) grew to 27M pages.

...


* I brought 1GB output of iteration 1 to the master node  & unzip it on the /mnt disk (took 5 for scp and 5 for unzip)

...


* I  run 20 mappers and 10 reducers for every step (for 10-node cluster)

...


*Timing results

...

*
# copy local file to HDFS :  ~10 minutes. Hadoop decided to divide the data in to 40 sets (and will issue 40 mapp  jobs)

...


# 3 mapp jobs  finished after 8 minutes.

...

 
# 5 mapp jobs  finished after 16 minutes.

...

 
# 16 mapp jobs  finished after 29 minutes.

...

 
# all 40 mapp jobs  finished after 42 minutes (one of the map jobs was restarted during this time)

...


# reduce failed for all 10 jobs after ~5 minutes, all 10 ~simultaneously

...


# hadoop tried twice to reissue the 10 sort+10 reduce jobs and it failed again after another ~5 minutes

...


At this stage I killed the cluster. It was consuming 11 CPU/hour and I had no clue how to debug it. I suspect some internal memory (HDFS ?) limit was not sufficient to hold sort results after mapp tasks. My estimate is 3GB of unzipped input could grew by a factor of few - may be there is a 10GB limit I should change (or pay extra?)