python - NLTK - when to normalize the text? -


i've finished gathering data plan use corpus, i'm bit confused whether should normalize text. plan tag & chunk corpus in future. of nltk's corpora lower case , others aren't.

can shed light on subject, please?

by "normalize" mean making lowercase?

the decision whether lowercase dependent of plan do. purposes, lowercasing better because lowers sparsity of data (uppercase words rarer , might confuse system unless have massive corpus such statistics on capitalized words decent). in other tasks, case information might valuable.

additionally, there other considerations you'll have make similar. example, should "can't" treated ["can't"], ["can", "'t"], or ["ca", "n't"] (i've seen 3 in different corpora). 7-year-old? 1 long word? or 3 words should separated?

that said, there's no reason reformat corpus. can have code make these changes on fly. way original information still around later if ever need it.


Comments

Popular posts from this blog

javascript - Iterate over array and calculate average values of array-parts -

iphone - Using nested NSDictionary with Picker -

php - How can I edit my code to echo the data of child's element where my search term was found in, in XMLReader? -