python - NLTK - when to normalize the text? -
i've finished gathering data plan use corpus, i'm bit confused whether should normalize text. plan tag & chunk corpus in future. of nltk's corpora lower case , others aren't. can shed light on subject, please? by "normalize" mean making lowercase? the decision whether lowercase dependent of plan do. purposes, lowercasing better because lowers sparsity of data (uppercase words rarer , might confuse system unless have massive corpus such statistics on capitalized words decent). in other tasks, case information might valuable. additionally, there other considerations you'll have make similar. example, should "can't" treated ["can't"] , ["can", "'t"] , or ["ca", "n't"] (i've seen 3 in different corpora). 7-year-old ? 1 long word? or 3 words should separated? that said, there's no reason reformat corpus. can have code make these changes on fly....