I have been very interested in Lucene and Solr over the past few weeks and look at it as a great alternate way to store and query for data over the web. Think of Solr as a free version of Google…
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr’s powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.
See the complete feature list for more details.
You get the ability to ingest the Solr example data set out of the box so with that comes predefined schema.xml and solrconfig.xml which are the core Solr configuration files. I tested the version from the Apache website but also tested against the distribution found at the Lucid Imagination website and that seems to be a little more comprehensive as far as what is included.
Lucid actually provides a cross platform installer so you do not have to run the ant build mechanism. I like that a lot better
.
Anyway, I tried to find an easy way to ingest spatial data without any serious effort and I found that you can easily update a Solr index with little to no effort with a properly formed CSV file. I downloaded one of the Geonames databases and converted it to CSV and voila…in a matter of minutes I had the whole thing ingested and ready to be queried through the Solr web interface.
There are a couple caveats that I would like to share…
- The CSV must be comma delimited. There are provisions through Solr that will allow you to use different delimiters but none of them work unless there is an actual character in the file (eg a comma or backslash)
- You must modify the file called schema.xml to reflect the following.
- You can then import the CSV using the following commands. The key is to define the column names using the parameter “fieldnames”.
curl ‘http://localhost:8983/solr/update/csv?commit=true&separator=%2C&fieldnames=geonameid,name,asciiname,alternatenames,latitude,longitude,featureclass,featurecode,countrycode,cc2,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdate&stream.file=/Users/Adam/LucidWorks/cities5000.csv&overwrite=true&stream.contentType=text/plain;charset=utf-8′
Next, run the “optimize” command.
curl http://localhost:8983/solr/update -H “Content-Type: text/xml” –data-binary “<optimize />”
- Now you have an index that can be queried against using the build-in Solr query page (http://localhost:8983/solr/admin). Please read the Solr page on what params to use. A good example query is as follows which shows how to use the “mlt” or MoreLikeThis params.
http://localhost:8983/solr/select/?q=alternatenames%3A%22el+Paso%22&version=2.2&start=0&rows=10&indent=on&mlt=true&mlt.fl=name&mlt.mindf=1&mlt.mintf=1&wt=json
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | <field name="geonameid" type="string" indexed="true" stored="true" /> <field name="name" type="text" indexed="true" stored="true" /> <field name="asciiname" type="text" indexed="true" stored="true" /> <field name="alternatenames" type="text" indexed="true" stored="true" /> <field name="latitude" type="string" indexed="true" stored="true" /> <field name="longitude" type="string" indexed="true" stored="true" /> <field name="featureclass" type="text" indexed="true" stored="true" /> <field name="featurecode" type="string" indexed="true" stored="true" /> <field name="countrycode" type="string" indexed="true" stored="true" /> <field name="cc2" type="string" indexed="true" stored="true" /> <field name="admin1code" type="string" indexed="true" stored="true" /> <field name="admin2code" type="string" indexed="true" stored="true" /> <field name="admin3code" type="string" indexed="true" stored="true" /> <field name="admin4code" type="string" indexed="true" stored="true" /> <field name="population" type="string" indexed="true" stored="true" /> <field name="elevation" type="string" indexed="true" stored="true" /> <field name="gtopo30" type="string" indexed="true" stored="true" /> <field name="timezone" type="string" indexed="true" stored="true" /> <field name="modificationdate" type="string" indexed="true" stored="true" /> <uniquekey>geonameid</uniquekey> |
There are quite a bit of things you can do once you get the data in the index. You can perform many different kinds of standard searches, faceted searches, MoreLikeThis, etc… For Geospatial data, we will have to extend the queryresponder classes to return KML rather than standard XML, JSON or CSV. More on that later…
Enjoy!
More from Adam Estrada
- WordPress as a CMS
- Customizing config.xml in the ESRI FlexViewer
- Oracle Spatial and USER_SDO_GEOM_METADATA
- SpatiaLite and Smart Phones
- How to copy music from your iPod to your iPhone