Solr and Geonames

I have been very interested in Lucene and Solr over the past few weeks and look at it as a great alternate way to store and query for data over the web. Think of Solr as a free version of Google…

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr’s powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

See the complete feature list for more details.

You get the ability to ingest the Solr example data set out of the box so with that comes predefined schema.xml and solrconfig.xml which are the core Solr configuration files. I tested the version from the Apache website but also tested against the distribution found at the Lucid Imagination website and that seems to be a little more comprehensive as far as what is included.

Lucid actually provides a cross platform installer so you do not have to run the ant build mechanism. I like that a lot better ;-) .

Anyway, I tried to find an easy way to ingest spatial data without any serious effort and I found that you can easily update a Solr index with little to no effort with a properly formed CSV file. I downloaded one of the Geonames databases and converted it to CSV and voila…in a matter of minutes I had the whole thing ingested and ready to be queried through the Solr web interface.

There are a couple caveats that I would like to share…

  1. The CSV must be comma delimited. There are provisions through Solr that will allow you to use different delimiters but none of them work unless there is an actual character in the file (eg a comma or backslash)
  2. You must modify the file called schema.xml to reflect the following.
  3. 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
       <field name="geonameid" type="string" indexed="true" stored="true" />
       <field name="name" type="text" indexed="true" stored="true" />
       <field name="asciiname" type="text" indexed="true" stored="true" />
       <field name="alternatenames" type="text" indexed="true" stored="true" />
       <field name="latitude" type="string" indexed="true" stored="true" />
       <field name="longitude" type="string" indexed="true" stored="true" />
       <field name="featureclass" type="text" indexed="true" stored="true" />
       <field name="featurecode" type="string" indexed="true" stored="true" />
       <field name="countrycode" type="string" indexed="true" stored="true" />
       <field name="cc2" type="string" indexed="true" stored="true" />
       <field name="admin1code" type="string" indexed="true" stored="true" />
       <field name="admin2code" type="string" indexed="true" stored="true" />
       <field name="admin3code" type="string" indexed="true" stored="true" />
       <field name="admin4code" type="string" indexed="true" stored="true" />
       <field name="population" type="string" indexed="true" stored="true" />
       <field name="elevation" type="string" indexed="true" stored="true" />
       <field name="gtopo30" type="string" indexed="true" stored="true" />
       <field name="timezone" type="string" indexed="true" stored="true" />
       <field name="modificationdate" type="string" indexed="true" stored="true" />
     
     <uniquekey>geonameid</uniquekey>
  4. You can then import the CSV using the following commands. The key is to define the column names using the parameter “fieldnames”.

    curl ‘http://localhost:8983/solr/update/csv?commit=true&separator=%2C&fieldnames=geonameid,name,asciiname,alternatenames,latitude,longitude,featureclass,featurecode,countrycode,cc2,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdate&stream.file=/Users/Adam/LucidWorks/cities5000.csv&overwrite=true&stream.contentType=text/plain;charset=utf-8′

    Next, run the “optimize” command.

    curl http://localhost:8983/solr/update -H “Content-Type: text/xml” –data-binary “<optimize />”

  5. Now you have an index that can be queried against using the build-in Solr query page (http://localhost:8983/solr/admin). Please read the Solr page on what params to use. A good example query is as follows which shows how to use the “mlt” or MoreLikeThis params.

    http://localhost:8983/solr/select/?q=alternatenames%3A%22el+Paso%22&version=2.2&start=0&rows=10&indent=on&mlt=true&mlt.fl=name&mlt.mindf=1&mlt.mintf=1&wt=json

There are quite a bit of things you can do once you get the data in the index. You can perform many different kinds of standard searches, faceted searches, MoreLikeThis, etc… For Geospatial data, we will have to extend the queryresponder classes to return KML rather than standard XML, JSON or CSV. More on that later…

Enjoy!

Posted in Apache Project, GeoSpatial | Tagged , | Leave a comment

FOSS4Geo Software Lists

I am a “moderator” at the GIS Forum and when we were setting up the site I figured what better way to get the message out about free and open source software for the Geospatial industry than to add the list at http://opensourcegis.org/ to the Forum wiki. No worries, I got full permission from the author before posting it as a live wiki page…Anyway, the page can me found here and we really encourage everyone to post the projects you come across so that other people become aware of it. After all, free and open software projects thrive on the community who use them. I for one have become an absolute nut when it comes to FOSS technologies. Chances are that if you have a problem, someone else has solved it and posted their source code to the web in some shape or form. Of course, that isn’t always true because lets face it, sometimes you or your client simply have to pay for software and services. I wrote a little article a while back with my thoughts on Choosing the GIS that’s Right for You. Sometimes paying for what you’re looking for is worth it, right? After all, time is money…What are your thoughts on this?

Posted in The GIS Forum | Tagged | Leave a comment

MapProxy Project for Accelerating your WMS

There are a few Open Source WMS caching proxys that are freely available for anyone to download and use. GeoWebCache and TileCache have been the two leaders in the MapCache game for a while now. The problem with these is that they only output WMS-C which most WMS readers can’t read. MapProxy can ingest WMS, TMS and KML then spit it back out as a valid WMS service so that ALL modern GIS clients can read it.

MapProxy can perform a wide variety of other services that will help to ensure that the WMS that is produced is fully optimized to meet your needs.

MapProxy can:

* accelerate existing WMS
* reproject to other SRS (i.e. cache in EPSG:4326, requests in EPSG:31467)
* combine individual map layers from different WMS services
* hide the origin WMS servers
* fill caches dynamic, in advance or both
* add watermarks and/or attributions to all responses

Some of the features of the new release (from the Mailing List):

* There is a new seed tool that is far more advanced than existing tools.
See below for more information.

* There is a new link_single_color_images option for layers. If enabled
MapProxy will not store tiles that only contain a single color as a
separate file. With the new option MapProxy stores these tiles only
once and uses links to this file for every occurrence. This can reduce
the size of your tile cache if you have larger areas with no data
(e.g. water areas, areas with no roads, etc.). This feature is only
available on Unix since Windows has no support for symbolic links.

* We have added some performance improvements for servers with multiple
layers and layers with smaller BBOXs.

* You can configure you own proj4 definition files. For example, if you
need to tweak some projection parameters.

More information about the new seeding tool:

* Fine control of the seeding area. You are not restricted to BBOX
anymore and can now load polygons that define the area you want to
seed. You can load the geometries from text files (WKT) and any data
source that is supported by OGR (Shapefile, PostGIS, etc.). This can
reduce the number of tiles to seed dramatically.
See http://mapproxy.org/docs/0.8.3.dev-20100430/seed.html#geographical-extend

* Uses multiple processes for the whole seeding chain (request images,
split, process, encode and store tiles) for better performance on
multi-core systems. A new –concurrency option lets you define the
number of processes.

* New sophisticated seeding strategy. The existing tools (MapProxy,
TileCache, GeoWebCache) seed from top level to bottom level, from
north to south, from east to west. This simple strategy works against
the caches of your operating system and database and results in
unneeded IO load and thus slower seeding performance. If you have
large datasets your database will have to drop cached data when your
seeding area moves to the south. When one level is seeded and the
seeding tool starts in the north of the next level all data cached by
you database and operating system is gone and needs to be loaded from
hard disk again.

This is another fine example of a project that far exceeds the needs of the community who uses it! Thanks Guys…

Posted in OSGeo, Web Mapping | Tagged , | Leave a comment

OpenSearching with Geocommons

If you are familiar with Geocommons you probably know that it’s very difficult to bulk download data. After chatting with Kate this past week, I came to realize that Geocommons does allow you to perform an OpenSearch-style search on it’s data back end. This prompted me to do a little hacking and the result is the following Python script. It’s not very sophisticated and I wish I knew more about  Python but it does get the job done. You’ll note right away that it makes the OpenSearch-style query to Finder and returns a KML of the data within the bounds of your search. For example, if you want to find all the data with the query term ‘Haiti’ within the bounds of the island nation the request would resemble the following:

http://finder.geocommons.com/search.kml?query=Haiti&limit=100&bbox=-72.725,17.993,-71.715,19.049

You have to parse the KML to get to the page that holds the data set. Then you have to convert the ‘.html’ extension to ‘.zip’. Wget can then be employed to do the actual downloading from the parsed KML. You can also recursively search through the directory and unzip the all the ZIP files but I have not got that far yet…

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#!/usr/bin/python
'''
Created on May 17, 2010
 
@author: adamestrada
'''
import sys, os
import urllib, urllib2
from xml.dom import minidom
from optparse import OptionParser
 
#URL = 'http://finder.geocommons.com/search.kml?query=Haiti&limit=100&bbox=-72.725,17.993,-71.715,19.049'
FINDER_URL = 'http://finder.geocommons.com/search.kml'
 
class getGeocommons:
# =============================================================================
# Parse Command Line Args
# =============================================================================
    def main(self, object):
        try:
            if object is not None:
                parser = OptionParser(usage="%prog [-q] [-l] [-b]", version="%prog 1.0")
                parser.add_option("-q", "--query", help="Enter a query term (eg. Haiti)")
                parser.add_option("-l", "--limit", help="Record limit is set to > 1 (eg. 100)")
                parser.add_option("-b", "--bbox", help="Enter the Query Extent (eg. -74.049,17.841,-71.874,20.052")
                if object == -1:
                    print parser.print_help()
                    sys.exit(0)
            #print parser.parse_args()
            getGeocommons().buildURL()
        except:
            print parser.print_help()
 
# =============================================================================
# Build URL with appropriate querystring params.
# =============================================================================
    def buildURL(self):
        try:
            # ghetto makin it work!
            query = sys.argv[2]
            limit = sys.argv[4]
            bbox = sys.argv[6]
 
            # Build query string
            params = {}
            params['query'] = query
            params['limit'] = limit
            params['bbox']  = bbox
            urlParams = urllib.urlencode(params)
            url = FINDER_URL
            full_url = url + '?' + urlParams
            print full_url
            getGeocommons().parseKML(full_url)
        except:
            print 'Something happend when building your querystring!'
 
# =============================================================================
# Get KML from Finder, parse it, then download ZIP files...
# =============================================================================
    def parseKML(self, full_url):
        params = full_url
        try:
            request = urllib2.urlopen(params).read()
            xml = minidom.parseString(request)
 
            for node in xml.getElementsByTagName('atom:link'):
                getString = str(node.toxml())
                subString = getString[getString.find('href')+5:getString.find('rel')]
                files = subString.replace('html', 'zip').replace('\"', '')
                # Use Wget to download the files.
                # Download wget here: http://www.gnu.org/software/wget/
                os.system('wget -r -nd -A zip -P Files ' + files)
 
        except urllib2.HTTPError, e:
            print "Cannot retrieve URL: HTTP Error Code", e.code
        except urllib2.URLError, e:
            print "Cannot retrieve URL: " + e.reason[1]
 
# =============================================================================
# Program mainline.
# =============================================================================
if __name__ == '__main__':
    if not sys.argv[1:]:
        getGeocommons().main(-1)
        sys.exit(0)
    else:
        params = sys.argv[1:]
        getGeocommons().main(params)

python getfiles.py -q Haiti -l 100 -b -72.725,17.993,-71.715,19.049

This script will download all the .zip files associated to you query to a directory called “Files”. Enjoy!

Posted in GeoSpatial | Tagged , , | 4 Comments

Web Crawling with Nutch

I came across Nutch the other day and finally got to the point where I can actually crawl and index web pages. The following are the steps needed to get you up and running.

  1. Download Nutch and build it. For this you will need Ant installed and configured properly.
  2. adam:nutch-1.0 adamestrada$ ant

    This will automatically detect the build.xml that manages the build process.

  3. Test to see if Nutch is properly installed
  4. You may have to set your JAVA_HOME environment variable if it’s not already set.

    JAVA_HOME=/Library/Java/Home
    export JAVA_HOME

    Now JAVA_HOME should be set and you are ready to move forward…

    adam:nutch-1.0 adamestrada$ bin/nutch
    Usage: nutch [-core] COMMAND
    where COMMAND is one of:
    crawl one-step crawler for intranets
    readdb read / dump crawl db
    convdb convert crawl db from pre-0.9 format
    mergedb merge crawldb-s, with optional filtering
    readlinkdb read / dump link db
    inject inject new urls into the database
    generate generate new segments to fetch from crawl db
    freegen generate new segments to fetch from text files
    fetch fetch a segment’s pages
    parse parse a segment’s pages
    readseg read / dump segment data
    mergesegs merge several segments, with optional filtering and slicing
    updatedb update crawl db from segments after fetching
    invertlinks create a linkdb from parsed segments
    mergelinkdb merge linkdb-s, with optional filtering
    index run the indexer on parsed segments and linkdb
    solrindex run the solr indexer on parsed segments and linkdb
    merge merge several segment indexes
    dedup remove duplicates from a set of segment indexes
    solrdedup remove duplicates from solr
    plugin load a plugin and run one of its classes main()
    server run a search server
    or
    CLASSNAME run the class named CLASSNAME
    Most commands print help when invoked w/o parameters.

    Expert: -core option is for developers only. It avoids building the job jar,
    instead it simply includes classes compiled with ant compile-core.
    NOTE: this works only for jobs executed in ‘local’ mode

  5. We now have to create a directory called “urls” and a flat file that will hold all the URL’s that we are going to crawl.
  6. This is simple…Just create a file called “nutch” call add a single url to it.

    I added http://www.thegisforum.com to mine…

  7. Now we need to modify the file located at conf/crawl-urlfilter.txt.
  8. 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    
    # Licensed to the Apache Software Foundation (ASF) under one or more
    # contributor license agreements.  See the NOTICE file distributed with
    # this work for additional information regarding copyright ownership.
    # The ASF licenses this file to You under the Apache License, Version 2.0
    # (the "License"); you may not use this file except in compliance with
    # the License.  You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
     
    # The url filter file used by the crawl command.
     
    # Better for intranet crawling.
    # Be sure to change MY.DOMAIN.NAME to your domain name.
     
    # Each non-comment, non-blank line contains a regular expression
    # prefixed by '+' or '-'.  The first matching pattern in the file
    # determines whether a URL is included or ignored.  If no pattern
    # matches, the URL is ignored.
     
    # skip file:, ftp:, &amp; mailto: urls
    -^(file|ftp|mailto):
     
    # skip image and other suffixes we can't yet parse
    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
     
    # skip URLs containing certain characters as probable queries, etc.
    #-[?*!@=]
    -[*!]
     
    # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
    -.*(/[^/]+)/[^/]+\1/[^/]+\1/
     
    # accept hosts in MY.DOMAIN.NAME
    +^http://([a-z0-9]*\.)*thegisforum.com/
     
    # skip everything else
    -.
  9. Now set a few properties in conf/nutch-sites.xml
  10. 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    
    < ?xml version="1.0"?>
    < ?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
     
    <!-- Put site-specific property overrides in this file. -->
     
    <configuration>
    <property>
      <name>http.agent.name</name>
      <value>awe</value>
      <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
      please set this to a single word uniquely related to your organization.
     
      NOTE: You should also check other related properties:
     
    	http.robots.agents
    	http.agent.description
    	http.agent.url
    	http.agent.email
    	http.agent.version
     
      and set their values appropriately.
     
      </description>
    </property>
     
    <property>
      <name>http.agent.description</name>
      <value>Adam's Awesome Bot!!!</value>
      <description>Further description of our bot- this text is used in
      the User-Agent header.  It appears in parenthesis after the agent name.
      </description>
    </property>
     
    <property>
      <name>http.agent.url</name>
      <value>http://www.adamestrada.com</value>
      <description>A URL to advertise in the User-Agent header.  This will 
       appear in parenthesis after the agent name. Custom dictates that this
       should be a URL of a page explaining the purpose and behavior of this
       crawler.
      </description>
    </property>
     
    <property>
      <name>http.agent.email</name>
      <value>estrada.adam@gmail.com</value>
      <description>An email address to advertise in the HTTP 'From' request
       header and User-Agent header. A good practice is to mangle this
       address (e.g. 'info at example dot com') to avoid spamming.
      </description>
    </property>
    </configuration>
  11. You’re ready to start crawling now!
  12. adam:nutch-1.0 adamestrada$ bin/nutch crawl urls -depth 5 -topN 100

There are many many more options to choose from with Nutch so I may explore them in future posts. Enjoy…

Posted in Apache Project | Tagged , | Leave a comment