Blog Archive

Friday, April 20, 2012

Information retrieval exercise

In order to understand this exercise you should know basics of indexing and MapReduce architecture  

Information retrieval exercise 4.3 from Introduction to Information Retrieval:

For n = 15 splits, r = 10 segments and j = 3 term partitions, how long would distributed index creation take for Reuters-RCV1 in a MapReduce architecture? Base your assumptions about cluster machines on Table 4.1.
Machine parameters

RCV1 collection

Solution:

We will be splitting by documents, so each split is roughly:
split_documents = 800000/15 = 53333 documents

Each split size is about:
Split_size =  53333documents x 200 token/document x 6 bytes/token = 63999600 bytes  ≈ 61 MB

MAP phase:

Time spent by a machine to read a split:
              Read_per_split = Split_size x (2 x 10-8 sec/byte) = 1.28 secs
Time spent to sort this split (algorithm complexity is O(nlog2n):
              Sort_time = split_documents  x 200 token/document x log2 (split_documents  x 200 token/document) x (10-8 sec/byte) = 2.49 secs
Time spent by a machine to write a split:
              Write_per_split = split_documents x 200 token/document x 4.5 x (2 x 10-8 sec/byte) = 0.96 secs -> note: here tokens are without spaces/punct. already

MAP phase is read+sort+write = 1.28 secs + 2.49 secs + 0.96 secs = 4.73 secs
There are 10 parser machines only since we have 10 segments. So to parse 15 splits we will need to do 2 passes of MAP.
Total MAP phase = 4.73 x 2 = 9.46 secs

REDUCE phase

Index is split into 3 term partitions, so each term partition will hold about 100000000/3 tokens(this is a rough assumption, in reality term partition size could vary). Each Inverter will need to read sort and write this amount of tokens.
             Term_ partition_size = 100000000/3 tokens x 4.5 bytes/tokens = 150000000 bytes ≈ 143 
             Time_reading = 150000000 bytes x (2 x 10-8 sec/byte) = 3 secs
             Time sorting = 100000000/3 tokens x log2 (100000000/3 tokens) x (10-8 sec/byte) = 8.33 secs
             Time_writing = 150000000 bytes x (2 x 10-8 sec/byte) = 3 secs

Total REDUCE phase =  3 + 8.33 + 3 = 14.33 secs

Total time of Distributed Index creation = 9.46 secs + 14.33 secs = 23.79 secs


No comments:

Post a Comment