Huffman encoding: how to write binary data in Python -
i have tried methods using struct module, shown lines commented out in code, didn't work out. have 2 options: can either write binary data code code (my code sequences of bits of length varying 3 13 bits), or convert whole string of n characters (n=25000+ in case) binary data. don't know how implement either methods. code:
import heapq import binascii import struct def createfrequencytuplelist(inputfile): frequencydic = {} intputfile = open(inputfile, 'r') line in intputfile: char in line: if char in frequencydic.keys(): frequencydic[char] += 1 else: frequencydic[char] = 1 intputfile.close() tuplelist = [] mykey in frequencydic: tuplelist.append((frequencydic[mykey],mykey)) return tuplelist def createhuffmantree(frequencylist): heapq.heapify(frequencylist) n = len(frequencylist) in range(1,n): left = heapq.heappop(frequencylist) right = heapq.heappop(frequencylist) newnode = (left[0] + right[0], left, right) heapq.heappush(frequencylist, newnode) return frequencylist[0] def printhuffmantree(mytree, somecode,prefix=''): if len(mytree) == 2: somecode.append((mytree[1] + "@" + prefix)) else: printhuffmantree(mytree[1], somecode,prefix + '0') printhuffmantree(mytree[2], somecode,prefix + '1') def parsecode(char, mycode): k in mycode: if char == k[0]: return k[2:] if __name__ == '__main__': mylist = createfrequencytuplelist('input') myhtree = createhuffmantree(mylist) mycode = [] printhuffmantree(myhtree, mycode) inputfile = open('input', 'r') outputfile = open('encoded_file2', "w+b") asciistring = '' n=0 line in inputfile: char in line: #outputfile.write(parsecode(char, mycode)) asciistring += parsecode(char, mycode) n += len(parsecode(char, mycode)) #values = asciistring #print n #s = struct.struct('25216s') #packed_data = s.pack(values) #print packed_data inputfile.close() #outputfile.write(packed_data) outputfile.close()
you're looking this:
packed_data = ''.join(chr(int(asciistring[i:i+8], 2)) in range(0, len(asciistring), 8))
it take 8 bits @ time asciistring
, interpret integer, , output corresponding byte.
your problem here requires length of asciistring
multiple of 8 bits work correctly. if not, you'll insert 0 bits before last few real bits.
so need store number of bits in last byte somewhere, know ignore bits when them back, instead of interpreting them zeros. try:
packed_data = chr(len(asciistring) % 8) + packed_data
then when read back:
packed_input = coded_file.read() last_byte_length, packed_input, last_byte = (packed_input[0], packed_input[1:-1], packed_input[-1]) if not last_byte_length: last_byte_length = 8 ascii_input = ''.join(chain((bin(ord(byte))[2:].zfill(8) byte in packed_input), tuple(bin(ord(last_byte))[2:].zfill(last_byte_length),))) # or # ascii_input = ''.join(chain(('{0:0=8b}'.format(byte) byte in packed_input), # tuple(('{0:0=' + str(last_byte_length) + '8b}').format(last_byte),)))
edit: either need strip '0b' strings returned bin()
or, on 2.6 or newer, preferably use new, alternate versions added use string formatting instead of bin()
, slicing, , zfill()
.
edit: eryksun, use chain avoid making copy of ascii string. also, need call ord(byte)
in bin()
version.
Comments
Post a Comment