posts - 374, comments - 606, trackbacks - 112,
Dan Bartels online engineering logbook... Items that I find useful organized for my use, and indexed by google for yours. Here you can find information about Microsoft C# programming, SQL SERVER 2005, Telligent Systems CommunityServer, Dot Net Nuke, Windows Vista / Longhorn, and just about anything else that would compel me to type.

News

What I am Reading:


Professional Community Server

Past books can be viewed on:

Post Categories

Archives

Coding Techniques

Coding Tools

Community Sites

Deal Finders

Friends Businesses

Microsofties

Smartphones

Utilities

my blogmap
Search Engine Spider for Lucene

I started with the cspider application which can be found here: http://sourceforge.net/project/showfiles.php?group_id=64424&package_id=106757

 

I added some code to build a basic Lucene search index, and currently this code also puts a copy of the text from a page into the Lucene files for the highlighter application to use on the search results page.  This application seems to work well for sites with up to about 1000 pages, after that, the Lucene indexes will grow too large to support practical in site searches, and additional optimization would be required.

 

Additionally I added the ability for the threads to use windows authentication, I have been thinking of a way to let them use forms based authentication, but have not hacked that feature as of yet.

 

So build this project, and run it.  The output directory should already exist on your drive. (in this case a folder called db on my c drive)

 

C# Spider

 

Set the URL to spider, and use the “begin index” button..  The begin Rip button enables a site download functionality from the original spider author.

 

It will create a Lucene index (a collection of files that you should copy to a folder in your web project.  For this site, it appears that the links use a redirector on the local domain, so in essence, all of the pages that I link to were added to the index, and their links were spidered but rejected because they did not fall in the host domain (www.danbartels.com).  The index for the 2369 urls is 7.5 mb total, and this includes the page excerpts.

 

The spider creates a Lucene document for each URL on the host domain, and populates the following fields: (an excerpt from about line 191 of the DocumentWorker class.

 

iw =

new Lucene.Net.Index.IndexWriter(m_spider.OutputPath, new Lucene.Net.Analysis.SimpleAnalyzer(), false );
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();
doc.Add(Lucene.Net.Documents.Field.Text("rawcontent",HTML));
doc.Add(Lucene.Net.Documents.Field.UnIndexed("content",StripTags(HTML)));
doc.Add(Lucene.Net.Documents.Field.Keyword("url",m_uri.AbsoluteUri));
doc.Add(Lucene.Net.Documents.Field.Keyword("title",GetTitleTag(HTML)));
doc.Add(Lucene.Net.Documents.Field.Keyword("meta",GetDescTag(HTML)));
iw.AddDocument(doc);

 

Next copy the index (the files in your output dir) to a directory on you web server, I usually make one called /lucene.

 

Now we can set about a simple interface to test the index.

 

Download the spider at (http://www.danbartels.com/Portals/0/code/csspider.20040929.zip)

 

-Dan

 

If you do find some updates, please feel obliged to send them back to me so I can incorporate them =)

Published Wednesday, September 29, 2004 8:42 PM by DanB

Filed under
::

Comments

# re: Search Engine Spider for Lucene @ Thursday, February 17, 2005 11:09 AM

Is your C# Spider still available for download?

Thanks ...

Matt Pasiew

# re: Search Engine Spider for Lucene @ Friday, February 18, 2005 8:57 AM

The source location has been updated....

Dan

Dan Bartels

# re: Search Engine Spider for Lucene @ Wednesday, August 29, 2007 3:07 AM

What modifications would be required to spider sites with lets say 10000 pages ?

Steve

Leave a Comment

(required) 
(required) 
(optional)
(required) 
 
Powered by Community Server (Commercial Edition), by Telligent Systems
powered by god (with a little help from Telligent Systems - Community Server 2.1)