I started with the cspider application which can be found here: http://sourceforge.net/project/showfiles.php?group_id=64424&package_id=106757
I added some code to build a basic Lucene search index, and currently this code also puts a copy of the text from a page into the Lucene files for the highlighter application to use on the search results page. This application seems to work well for sites with up to about 1000 pages, after that, the Lucene indexes will grow too large to support practical in site searches, and additional optimization would be required.
Additionally I added the ability for the threads to use windows authentication, I have been thinking of a way to let them use forms based authentication, but have not hacked that feature as of yet.
So build this project, and run it. The output directory should already exist on your drive. (in this case a folder called db on my c drive)
Set the URL to spider, and use the “begin index” button.. The begin Rip button enables a site download functionality from the original spider author.
It will create a Lucene index (a collection of files that you should copy to a folder in your web project. For this site, it appears that the links use a redirector on the local domain, so in essence, all of the pages that I link to were added to the index, and their links were spidered but rejected because they did not fall in the host domain (www.danbartels.com). The index for the 2369 urls is 7.5 mb total, and this includes the page excerpts.
The spider creates a Lucene document for each URL on the host domain, and populates the following fields: (an excerpt from about line 191 of the DocumentWorker class.
iw =
new Lucene.Net.Index.IndexWriter(m_spider.OutputPath, new Lucene.Net.Analysis.SimpleAnalyzer(), false );
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
doc.Add(Lucene.Net.Documents.Field.Text("rawcontent",HTML));
doc.Add(Lucene.Net.Documents.Field.UnIndexed("content",StripTags(HTML)));
doc.Add(Lucene.Net.Documents.Field.Keyword("url",m_uri.AbsoluteUri));
doc.Add(Lucene.Net.Documents.Field.Keyword("title",GetTitleTag(HTML)));
doc.Add(Lucene.Net.Documents.Field.Keyword("meta",GetDescTag(HTML)));
iw.AddDocument(doc);
Next copy the index (the files in your output dir) to a directory on you web server, I usually make one called /lucene.
Now we can set about a simple interface to test the index.
Download the spider at (http://www.danbartels.com/Portals/0/code/csspider.20040929.zip)
-Dan
If you do find some updates, please feel obliged to send them back to me so I can incorporate them =)