java - Bad Performance for Dedupe of 2 million records using mapreduce on Appengine -


i have 2 million records have 4 string fields each needs checked duplicates. more specific have name, phone, address , fathername fields , must check dedupe using these fields rest of data. resulting unique records need noted db.

i have been able implement mapreduce, iterarate of records. task rate set 100/s , bucket-size 100. billing enabled.

currently, working, performance very slow. have been able complete 1000 records dedupe processing among test dataset of 10,000 records in 6 hours.

the current design in java is:

  1. in every map iteration, compare current record previous record
  2. previous record single record in db acts global variable overwrite previous record in each map iteration
  3. comparison done using algorithm , result written new entity db
  4. at end of 1 mapreduce job, programatically create job
  5. the previous record variable helps job compare next candidate record rest of data

i ready increase amount of gae resources achieve in shortest time.

my questions are:

  1. will accuracy of dedupe (checking duplicates) affect due parallel jobs/tasks?
  2. how can design improved?
  3. will scale 20 million records
  4. whats fastest way read/write variables (not counters) during map iteration can used across 1 mapreduce job.

freelancers welcome assist in this.

thanks help.

i see 2 ways approach problem:

  1. (if need once) appengine creates property index every property in entity (unless ask not that). create backend, run query "select * order " in batches using cursors, determine duplicated properties , fix/delete those. might able parallelize this, it's tricky on shard boundaries , have write code yourself.

  2. you can use mapper framework slower, run in parallel. approach allows efficiently dedupe data on insert. introduce new entity hold unique property values. "uniquephonenumber". entity should hold phone number key , reference entity phone number. run map , lookup uniquephonenumber. if it's found , reference valid, delete duplicate. if not create new 1 correct reference. way it's possible repoint reference other one, if need to. make sure read uniquephonenumber , create new one/update new 1 inside single transaction. otherwise duplicates won't detected.


Comments

Popular posts from this blog

linux - Using a Cron Job to check if my mod_wsgi / apache server is running and restart -

actionscript 3 - TweenLite does not work with object -

jQuery Ajax Render Fragments OR Whole Page -