java - Inputstream handled by different objects depending on the content -

- August 15, 2013

i writing crawler/parser should able process different types of content, being rss, atom , plain html files. determine correct parser, wrote class called parsefactory, takes url, tries detect content-type, , returns correct parser.

unfortunately, checking content-type using provided in method in urlconnection doesn't work. example,

string contenttype = url.openconnection().getcontenttype();

doesn't provide correct content-type (e.g "text/html" should rss) or doesn't allow distinguish between rss , atom (e.g. "application/xml" both atom or rss feed). solve problem, started looking clues in inputstream. problem having trouble coming elegant class design, need download inputstream once. in current design have wrote separate class first determines correct content-type, next parsefactory uses information create instance of corresponding parser, in turn, when method 'parse()' called, downloads entire inputstream second time.

public parser createparser(){      inputstream inputstream = null;     string contenttype = null;     string contentencoding = null;      contenttypeparser contenttypeparser = new contenttypeparser(this.url);     parser parser = null;      try {          inputstream = new bufferedinputstream(this.url.openstream());         contenttypeparser.parse(inputstream);         contenttype = contenttypeparser.getcontenttype();         contentencoding = contenttypeparser.getcontentencoding();          assert (contenttype != null);          inputstream = new bufferedinputstream(this.url.openstream());          if (contenttype.equals(contenttypes.rss))         {             logger.info("rss feed detected");             parser = new rssparser(this.url);             parser.parse(inputstream);         }         else if (contenttype.equals(contenttypes.atom))         {             logger.info("atom feed detected");             parser = new atomparser(this.url);         }         else if (contenttype.equals(contenttypes.html))         {             logger.info("html detected");             parser = new htmlparser(this.url);             parser.setcontentencoding(contentencoding);         }         else if (contenttype.equals(contenttypes.unknown))             logger.debug("unable recognize content type");          if (parser != null)             parser.parse(inputstream);      } catch (ioexception e) {         e.printstacktrace();     } {         try {             inputstream.close();         } catch (ioexception e) {             e.printstacktrace();         }     }      return parser;  }

basically, looking solution allows me eliminate second "inputstream = new bufferedinputstream(this.url.openstream())".

any appreciated!

side note 1: sake of being complete, tried using urlconnection.guesscontenttypefromstream(inputstream) method, returns null way often.

side note 2: xml-parsers (atom , rss) based on saxparser, html-parser on jsoup.

can call mark , reset?

inputstream = new bufferedinputstream(this.url.openstream()); inputstream.mark(2048); // or other sensible number  contenttypeparser.parse(inputstream); contenttype = contenttypeparser.getcontenttype(); contentencoding = contenttypeparser.getcontentencoding();  inputstream.reset(); // let parser have crack @

Search This Blog

C A N B

java - Inputstream handled by different objects depending on the content -

Comments

Post a Comment

Popular posts from this blog

php - How can I edit my code to echo the data of child's element where my search term was found in, in XMLReader? -

java - Why is BlockingQueue.take() not releasing the thread? -

jQuery Ajax Render Fragments OR Whole Page -