In order to understand this exercise you should know basics of indexing and MapReduce architecture
Information retrieval exercise 4.3 from Introduction to Information Retrieval:
For n = 15 splits, r = 10 segments and j = 3 term partitions, how long would distributed index creation take for Reuters-RCV1 in a MapReduce architecture? Base your assumptions about cluster machines on Table 4.1.
![]() | ||
RCV1 collection |
Solution:
We will be splitting by documents, so each split is roughly:
split_documents = 800000/15 = 53333 documents
split_documents = 800000/15 = 53333 documents
Each split size is about:
Split_size = 53333documents x 200 token/document x 6 bytes/token = 63999600 bytes ≈ 61 MB
Read_per_split = Split_size x (2 x 10-8 sec/byte) = 1.28 secs
Time spent to sort this split (algorithm complexity is O(nlog2n):
Sort_time = split_documents x 200 token/document x log2 (split_documents x 200 token/document) x (10-8 sec/byte) = 2.49 secs
Time spent by a machine to write a split:
Write_per_split = split_documents x 200 token/document x 4.5 x (2 x 10-8 sec/byte) = 0.96 secs -> note: here tokens are without spaces/punct. already
MAP phase is read+sort+write = 1.28 secs + 2.49 secs + 0.96 secs = 4.73 secs
There are 10 parser machines only since we have 10 segments. So to parse 15 splits we will need to do 2 passes of MAP.
Total MAP phase = 4.73 x 2 = 9.46 secs
MAP phase:
Time spent by a machine to read a split:Read_per_split = Split_size x (2 x 10-8 sec/byte) = 1.28 secs
Time spent to sort this split (algorithm complexity is O(nlog2n):
Sort_time = split_documents x 200 token/document x log2 (split_documents x 200 token/document) x (10-8 sec/byte) = 2.49 secs
Time spent by a machine to write a split:
Write_per_split = split_documents x 200 token/document x 4.5 x (2 x 10-8 sec/byte) = 0.96 secs -> note: here tokens are without spaces/punct. already
MAP phase is read+sort+write = 1.28 secs + 2.49 secs + 0.96 secs = 4.73 secs
There are 10 parser machines only since we have 10 segments. So to parse 15 splits we will need to do 2 passes of MAP.
Total MAP phase = 4.73 x 2 = 9.46 secs
REDUCE phase
Index is split into 3 term partitions, so each term partition will hold about 100000000/3 tokens(this is a rough assumption, in reality term partition size could vary). Each Inverter will need to read sort and write this amount of tokens.
Term_
partition_size = 100000000/3 tokens x 4.5 bytes/tokens = 150000000 bytes ≈ 143
Time_reading = 150000000 bytes x (2 x 10-8 sec/byte) = 3 secs
Time sorting = 100000000/3 tokens x log2 (100000000/3 tokens) x (10-8 sec/byte) = 8.33 secs
Time_writing = 150000000 bytes x (2 x 10-8 sec/byte) = 3 secs
Total REDUCE phase = 3 + 8.33 + 3 = 14.33 secs
No comments:
Post a Comment