java - Bad Performance for Dedupe of 2 million records using mapreduce on Appengine -

- March 15, 2013

i have 2 million records have 4 string fields each needs checked duplicates. more specific have name, phone, address , fathername fields , must check dedupe using these fields rest of data. resulting unique records need noted db.

i have been able implement mapreduce, iterarate of records. task rate set 100/s , bucket-size 100. billing enabled.

currently, working, performance very slow. have been able complete 1000 records dedupe processing among test dataset of 10,000 records in 6 hours.

the current design in java is:

in every map iteration, compare current record previous record
previous record single record in db acts global variable overwrite previous record in each map iteration
comparison done using algorithm , result written new entity db
at end of 1 mapreduce job, programatically create job
the previous record variable helps job compare next candidate record rest of data

i ready increase amount of gae resources achieve in shortest time.

my questions are:

will accuracy of dedupe (checking duplicates) affect due parallel jobs/tasks?
how can design improved?
will scale 20 million records
whats fastest way read/write variables (not counters) during map iteration can used across 1 mapreduce job.

freelancers welcome assist in this.

thanks help.

i see 2 ways approach problem:

(if need once) appengine creates property index every property in entity (unless ask not that). create backend, run query "select * order " in batches using cursors, determine duplicated properties , fix/delete those. might able parallelize this, it's tricky on shard boundaries , have write code yourself.
you can use mapper framework slower, run in parallel. approach allows efficiently dedupe data on insert. introduce new entity hold unique property values. "uniquephonenumber". entity should hold phone number key , reference entity phone number. run map , lookup uniquephonenumber. if it's found , reference valid, delete duplicate. if not create new 1 correct reference. way it's possible repoint reference other one, if need to. make sure read uniquephonenumber , create new one/update new 1 inside single transaction. otherwise duplicates won't detected.

Search This Blog

C A N B

java - Bad Performance for Dedupe of 2 million records using mapreduce on Appengine -

Comments

Post a Comment

Popular posts from this blog

php - How can I edit my code to echo the data of child's element where my search term was found in, in XMLReader? -

javascript - Iterate over array and calculate average values of array-parts -

jQuery Ajax Render Fragments OR Whole Page -