java - Inputstream handled by different objects depending on the content -
i writing crawler/parser should able process different types of content, being rss, atom , plain html files. determine correct parser, wrote class called parsefactory, takes url, tries detect content-type, , returns correct parser.
unfortunately, checking content-type using provided in method in urlconnection doesn't work. example,
string contenttype = url.openconnection().getcontenttype();
doesn't provide correct content-type (e.g "text/html" should rss) or doesn't allow distinguish between rss , atom (e.g. "application/xml" both atom or rss feed). solve problem, started looking clues in inputstream. problem having trouble coming elegant class design, need download inputstream once. in current design have wrote separate class first determines correct content-type, next parsefactory uses information create instance of corresponding parser, in turn, when method 'parse()' called, downloads entire inputstream second time.
public parser createparser(){ inputstream inputstream = null; string contenttype = null; string contentencoding = null; contenttypeparser contenttypeparser = new contenttypeparser(this.url); parser parser = null; try { inputstream = new bufferedinputstream(this.url.openstream()); contenttypeparser.parse(inputstream); contenttype = contenttypeparser.getcontenttype(); contentencoding = contenttypeparser.getcontentencoding(); assert (contenttype != null); inputstream = new bufferedinputstream(this.url.openstream()); if (contenttype.equals(contenttypes.rss)) { logger.info("rss feed detected"); parser = new rssparser(this.url); parser.parse(inputstream); } else if (contenttype.equals(contenttypes.atom)) { logger.info("atom feed detected"); parser = new atomparser(this.url); } else if (contenttype.equals(contenttypes.html)) { logger.info("html detected"); parser = new htmlparser(this.url); parser.setcontentencoding(contentencoding); } else if (contenttype.equals(contenttypes.unknown)) logger.debug("unable recognize content type"); if (parser != null) parser.parse(inputstream); } catch (ioexception e) { e.printstacktrace(); } { try { inputstream.close(); } catch (ioexception e) { e.printstacktrace(); } } return parser; }
basically, looking solution allows me eliminate second "inputstream = new bufferedinputstream(this.url.openstream())".
any appreciated!
side note 1: sake of being complete, tried using urlconnection.guesscontenttypefromstream(inputstream) method, returns null way often.
side note 2: xml-parsers (atom , rss) based on saxparser, html-parser on jsoup.
can call mark
, reset
?
inputstream = new bufferedinputstream(this.url.openstream()); inputstream.mark(2048); // or other sensible number contenttypeparser.parse(inputstream); contenttype = contenttypeparser.getcontenttype(); contentencoding = contenttypeparser.getcontentencoding(); inputstream.reset(); // let parser have crack @
Comments
Post a Comment