How to convert features of layer from POLYGON to MULTIPOLYGON?
How can I convert features of a layer from POLYGON to MULTIPOLYGON? I know how to do that using the postgis function ST_Multi but how could I do to same thing for a layer in QGIS?
If you want to do it based on a field, you can do this in QGIS from the menu: Vector-->Geometry Tools-->Singleparts to Multipart (requires at least two poylgons to share an attribute that you specify).
There is a more direct equivalent to ST_Multi in OGR. I didn't find a way to access this specific OGR functionality through QGIS, but that could be done using GDAL/OGR like this
ogr2ogr -nlt MULTIPOLYGON multipolygon_output.shp polygon_input.shpMore details on -nlt switch are available on the ogr2ogr page.
While still waiting for the simple QGIS solution you can have a look at how it goes with OpenJUMP through the right-click menu:
R as GIS for Economists
Here, we learn how different types of sfg are constructed. We also learn how to create sfc and sf from sfg from scratch. 37
2.2.1 Simple feature geometry ( sfg )
The sf package uses a class of sfg (simple feature geometry) objects to represent a geometry of a single geometric feature (say, a city as a point, a river as a line, county and school district as polygons). There are different types of sfg s. Here are some example feature types that we commonly encounter as an economist 38 :
- POINT : area-less feature that represents a point (e.g., well, city, farmland)
- LINESTRING : (e.g., a tributary of a river)
- MULTILINESTRING : (e.g., river with more than one tributary)
- POLYGON : geometry with a positive area (e.g., county, state, country)
- MULTIPOLYGON : collection of polygons to represent a single object (e.g., countries with islands: U.S., Japan)
POINT is the simplest geometry type and is represented by a vector of two 39 numeric values. An example below shows how a POINT feature can be made from scratch:
The st_point() function creates a POINT object when supplied with a vector of two numeric values. If you check the class of the newly created object,
you can see that it’s indeed a POINT object. But, it’s also an sfg object. So, a_point is an sfg object of type POINT .
A LINESTRING objects are represented by a sequence of points:
s1 is a matrix where each row represents a point. By applying st_linestring() function to s1 , you create a LINESTRING object. Let’s see what the line looks like.
As you can see, each pair of consecutive points in the matrix are connected by a straight line to form a line.
A POLYGON is very similar to LINESTRING in the manner it is represented.
Just like the LINESTRING object we created earlier, a POLYGON is represented by a collection of points. The biggest difference between them is that we need to have some positive area enclosed by lines connecting the points. To do that, you have the the same point for the first and last points to close the loop: here, it’s c(0,0) . A POLYGON can have a hole in it. The first matrix of a list becomes the exterior ring, and all the subsequent matrices will be holes within the exterior ring.
You can create a MULTIPOLYGON object in a similar manner. The only difference is that you supply a list of lists of matrices, with each inner list representing a polygon. An example below:
Each of list(p1,p2) , list(p3,p4) , list(p5) represents a polygon. You supply a list of these lists to the st_multipolygon() function to make a MULTIPOLYGON object.
2.2.2 Create simple feature geometry list-column ( sfc ) and simple feature ( sf ) from scratch
To make a simple feature geometry list-column ( sfc ), you can simply supply a list of sfg to the st_sfc() function as follows:
To create an sf object, you first add an sfc as a column to a data.frame .
At this point, it is not yet recognized as an sf by R yet.
You can register it as an sf object using st_as_sf() .
As you can see sf_ex is now recognized also as an sf object.
Creating spatial objects from scratch yourself is an unnecessary skill for many of us as economists. But, it is still good to know the underlying structure of the data. Also, occasionally the need arises. For example, I had to construct spatial objects from scratch when I designed on-farm randomized nitrogen trials. In such cases, it is of course necessary to understand how different types of sfg are constructed, create sfc from a collection of sfg s, and then create an sf from an sfc .↩︎
You will hardly see the other geometry types: MULTIPOINT and GEOMETRYCOLLECTION. You may see GEOMETRYCOLLECTION after intersecting two spatial objects. You can see here if you are interested in learning what they are.↩︎
R as GIS for Economists
Here we discuss ways to parallelize the process of extracting values from many of multi-layer raster files.
We will use the following datasets:
- raster: daily PRISM data 2010 through 2019 stacked by month
- polygons: Regular polygon grids over Iowa
daily PRISM precipitation 2010 through 2019
You can download all the prism files from here. For those who are interested in learning how to generate the series of daily PRISM data files stored by month, see section 9.3 for the code.
6.2.2 Non-parallelized extraction
We have already learned in Chapter 5.3 that extracting values from stacked raster layers is faster than doing so from multiple single-layer raster datasets one at a time. Here, daily precipitation datasets are stacked by year-month and saved as multi-layer GeoTIFF files. For example, PRISM_ppt_y2009_m1.tif stores the daily precipitation data for January, 2009. This is how long it takes to extract values for US counties from a month’s of daily PRISM precipitation data.
Now, to process all the precipitation data from 2009-2018, we consider two approaches in this section are:
- parallelize over polygons and do regular loop over year-month
- parallelize over year-month
6.2.3 Approach 1: parallelize over polygons and do regular loop over year-month
For this approach, let’s measure the time spent on processing one year-month PRISM dataset and then guess how long it would take to process 120 year-month PRISM datasets.
Okay, so this approach does not really help. If we are to process 10 years of daily PRISM data, then it would take roughly 167.39 minutes.
6.2.4 Approach 2: parallelize over the temporal dimension (year-month)
Instead of parallelize over polygons, let’s parallelize over time (year-month). To do so, we first create a data.frame that has all the year-month combinations we will work on.
The following function extract data from a single year-month case:
We then loop over the rows of month_year_data in parallel.
It took 7.52 minutes. So, Approach 2 is the clear winner.
6.2.5 Memory consideration
So far, we have paid no attention to the memory footprint of the parallelized processes. But, it is crucial when parallelizing many large datasets. Approaches 1 and 2 differ substantially in their memory footprints.
Approach 1 divides the polygons into a group of polygons and parallelizes over the groups when extracting raster values. Approach 2 extracts and holds raster values for 15 of the whole U.S. polygons. So, Approach 1 clearly has a lesser memory footprint. Approach 2 used about 40 Gb of the computer’s memory, almost maxing out the 64 Gb RAM memory of my computer (it’s not just R or C++ that are consuming RAM memory at the time). If you do not go over the limit, it is perfectly fine. Approach 2 is definitely a better option for me. However, if I had 32 Gb RAM memory, Approach 2 would have suffered a significant loss in its performance, while Approach 1 would not have. Or, if the raster data had twice as many cells with the same spatial extent, then Approach 2 would have suffered a significant loss in its performance, while Approach 1 would not have.
It is easy to come up with a case where Approach 1 is preferable. For example, suppose you have multiple 10-Gb raster layers and your computer has 16 Gb RAM memory. Then, Approach 2 clearly does not work, and Approach 1 is your only choice, which is better than not parallelizing at all.
In summary, while letting each core process a larger amount of data, you need to be careful not to exceed the RAM memory limit of your computer.
Knowledge Discovery in Spatial Cartographic Information Retrieval.
LIBRARY CATALOGS FOR MAP COLLECTIONS are not well developed in most libraries. The cartographic information source differs from other kinds of information in that it is usually rectangular in shape and defined by the coordinates of the four map corners. This coordinate information proves difficult for many people to use, unless a certain user interface is designed and knowledge discovery algorithms are implemented. A system with such an interface and algorithms can perform powerful queries that an ordinary text-based information retrieval system cannot. This article describes a prototype system--GeoMatch--which allows users to interactively define geographic areas of interest on a background map. It also allows users to define, qualitatively or quantitatively, the relationship between the user-defined area and the map coverage. The knowledge discovery in database (KDD) factor is analyzed in the retrieval process. Three librarians were interviewed to study the feasibility of the new system. The MARC record format is also discussed to argue that conversion of cartographic material records from an existing library online catalog system to GeoMatch can be done automatically.
Knowledge discovery in databases (KDD) has become a hot topic in recent years. The KDD method has been used in various fields, including spatial database analysis (Xu et al., 1997), automatic classification (Bell, 1998), deviation detection (Schmitz, 1990), and clustering (Cheesman, 1996). This article explores the use of KDD in information retrieval by examining the nature and process of geographic information retrieval. It deals with the characteristics of Geographic Information Systems (GIS), Bibliographic Records for Cartographic Information, and a GIS-based cartographic information retrieval system--GeoMatch.
GIS AND FUNCTIONS RELATED TO THE GIS-BASED INFORMATION RETRIEVAL SYSTEM
The Environmental System Research Institute (ESRI) is the largest GIS software producer in the world. ESRI defines GIS in its menu (Environmental System Research Institute, 1991) as: "An organized collection of computer hardware, software, geographic data, and personnel designed to efficiently capture, store, update, manipulate, analyze, and display all forms of geographically referenced information." Most words in this definition can be found in definitions of many other information systems. What makes GIS special is the term geographically referenced data. GIS uses spatial location as the major link to organize and manipulate information.
A typical GIS has two major functional components--a database management system, which stores and manipulates the data, and a spatial engine, which performs special topological operations on geographic features. A common misunderstanding of GIS is to consider it merely a computerized mapmaker. GIS is a powerful analytical tool that is far more sophisticated than a mapmaker. It is true that some GIS products on the market are simplified for naive GIS users to generate, view, and print maps. These "viewer"/software packages often support only limited data manipulation functions. They are not considered fully functional GIS systems. A GIS can perform network analysis, overlay, buffering, and many other operations that few other information systems can accomplish. As Burrough (1990) summarized, a GIS can answer such questions as:
* Where is 785 S. Allen Street in Albany, New York?
* In what census tract is the above address located?
* How many supermarkets are within three miles from the above address?
* A delivery truck needs to deliver items to 200 customers. What is the shortest route and sequence to make the delivery? If road traffic information is available, what is the fastest route to finish the task?
* Given the population in a county, what is the population density? (GIS can calculate the area of the county precisely).
* A new shopping mall is going to be built in the city. The mall should be built at least five miles away from the existing shopping malls next to a major street surrounded by 5,000 residents within four miles and no more than ten miles from the downtown area. Where is the best place to build the new mall?
There are many other questions that only a GIS can answer. One of the GIS functions that is highly related to the geographic information retrieval system is overlay. Some concepts need to be defined to understand the overlay process.
In a GIS, a polygon is an enclosed area bounded by lines such as a census tract or a county. Consequently, polygons have areas and parameters that a GIS can calculate. A layer or a theme is a concept for a single feature map in GIS. For example, a county map of Florida showing the average age of a population is a polygon layer. These single-feature layers can be integrated by GIS for analysis.
GIS has the capability of building geometric topology. It can determine which lines are crossing one another to create a node at the cross point. It can detect what lines are connected to create an enclosed polygon. GIS can then generate a polygon object with features like area and parameter. The topology in a GIS can be expressed as the relationship of points, lines, and polygons. GIS can do sophisticated spatial analysis after the topology is established.
The process of merging multiple layers is called overlay, a unique function of GIS. For example, assume that there are two maps printed on transparencies--a map of census tracts and a map of a lake, all in the same county. If both maps are in exactly the same scale and the four corners of the two maps represent exactly the same locations, the two transparencies can be put together to make a new map--with both county boundaries and the lake shore. The new map is the so-called overlay. GIS is very powerful in performing this operation. It can overlay maps with different kinds of features (point, line, polygon) and develop new topologies for further analysis. Burrough (1990) lists forty-four kinds of overlay analysis capabilities that GIS may have. Figure 1 demonstrates the overlay process. The first map layer shows school district boundaries (District C and District D). The second map layer represents county boundaries (County A and County B). During the overlay process, GIS combines the features from both map layers into a third layer that contains four polygons. In the third map layer, each polygon will have attributes from both the county map layer and the school district map layer. For example, area 1 will have its area, parameter, county name A, school district name C, and other data previously stored in the two map layers. Obviously, it would be difficult to integrate the school district data and county data like this using only database techniques because the data collected represent different areas.
[Figure 1 ILLUSTRATION OMITTED]
KNOWLEDGE DISCOVERY IN DATABASES AND INFORMATION RETRIEVAL
Due to the less expensive data storage and increasing computing power, the volume of data collected by various organizations has expanded rapidly. This vast abundance of data, often stored in separate data sets, makes it more difficult to find relevant information. On the other hand, the power of computers also makes it possible to integrate the data sets, compile the facts, and develop the information into "a collection of related inferences" (Trybula, 1997). This is why KDD has received such attention from both the academic and commercial worlds. According to Tuzhilin (1997), the number of papers submitted to the Knowledge Discovery Workshop increased from 40 in 1993 to 215 in 1996.
Fayyad, Piatetsky-Shapiro, and Smyth (1996)define KDD as "the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns of data" (p. 2). As Trybula (1997) summarized, the methods of evaluating data include algorithms, association, change and deviation determination, visualization, and sixteen other analytical techniques. No matter which method is employed, the key point of KDD is to uncover new, useful, and understandable knowledge.
Information retrieval can be simply expressed as a matching process--matching a user's information need with the information source (School of Information Studies, 1998). In this process, a user must express his/ her information need accurately so that the system can retrieve the information. On the other hand, information sources need to be organized in such a way that the most important attributes, such as title, author, subject terms, keywords, publication year, and so on, are readily available.
Text information retrieval systems have become more powerful in the last three decades. The retrieval efficiency and effectiveness has been greatly improved through Boolean operators, truncations, proximity, probability search, and many other search mechanisms. However, some attributes in bibliographic records can create difficulty for exact match in a search. Some attributes are even difficult for users to understand. For example, geographic coordinates are attributes in MARC records for cartographic data. Few users would want or be able to enter exact numbers to match those coordinates. Even fewer would know what the numbers mean. Despite these difficulties, however, could the coordinates be useful in information retrieval? Can they be processed to provide understandable and useful knowledge in selecting relevant information?
This article will demonstrate a prototype of a GIS-based cartographic information retrieval system and illustrate how such a system could indeed generate new and useful knowledge during the retrieval process.
CARTOGRAPHIC INFORMATION RETRIEVAL
Cartographic Information Retrieval in Libraries
An access point is defined as "a name, term, code, etc., under which a bibliographic record may be searched and identified" (Glossary, 1995). An ordinary information retrieval system usually has common access points such as author, title, keywords, subject headings, classification number, and information from other special fields.
In addition to its spatial coverage, a cartographic information source, such as a single sheet map, shares most of the attributes other information sources have, including title and subject terms. A cartographic information source is different from other formats in that, as an information container, it is usually in the shape of a rectangle and contains the coordinates of the four map corners. Nevertheless, most current retrieval systems do not use geographic coordinates as access points because this does not make sense in a text information retrieval system. Many libraries are still in the process of retrospective conversion from card catalogs to text-based online catalogs for their map collections. To study the feasibility of libraries adopting a GIS-based cartographic information retrieval system, long interviews with three librarians were conducted in two libraries in Tallahassee, Florida.
During each interview, a prototype of a GIS-based cartographic information retrieval system (GeoMatch) was demonstrated. The librarians were asked to answer questions concerning the library's map collection, user needs, retrieval tools, and searching procedures. The librarians were also asked to evaluate the usability of the prototype software and assess the usefulness of the system.
Most of the map collection in the Florida State Library consists of historical maps. Although the library is currently outsourcing the map cataloging to an organization associated with OCLC, the card catalog is still the major retrieval tool for the map collection. The library has added only 800 maps to its online catalog. The online catalog features keyword searching, which provides more retrieval power than the card catalog. The card catalog allows searching only from author, title, and subject terms. During the interviews, the librarians indicated that they had seen more patrons using the catalog since the online version was implemented.
The library has no plan yet to digitize (scan) the maps. Patrons usually cannot find needed maps using the card catalog. Some patrons can locate their maps using the online catalog with keyword searching. Generally speaking, patrons primarily rely on the map librarians to find and access maps.
Although the online catalog system cannot provide sufficient assistance for accessing cartographic information, every day many map users do search historic maps, railroad maps, and place names. Great reliance must be placed on the knowledge and expertise of the map librarians.
FLORIDA STATE UNIVERSITY LIBRARY
The Florida State University (FSU) library has a collection of 165,000 single sheet maps, including U. S. Geological Survey maps, road maps, city maps, thematic maps, and historical maps. Records for most of the single sheet maps are maintained in the card catalog. The librarians have started the retrospective conversion of map card catalog records to online catalog records using OCLC. According to the map librarian, most of the records can be found in the OCLC database. During the conversion process, the librarian must make minor changes before adding the OCLC records to the library's online catalog.
The librarians serve many map users everyday including faculty, students, and users referred by other libraries. The map librarians are very familiar with the map collection and usually can find the maps needed. The situation at the FSU library is similar to the one at the Florida State Library--i.e., the map librarians are the most valuable source of information, given the fact that the catalog system for the cartographic data is not very helpful.
In summary, map librarians in both libraries are the most important sources of information for users seeking cartographic data.
Both libraries are in the process of converting cartographic records in the card catalog to the online catalog. The online catalog with searching capability has led to increased map use.
Although most users can access the map information they need with the help of librarians, this situation needs to be improved, for several reasons. First, the map librarians are not certain whether or not they actually find the maps that best match users' needs. Second, none of the librarians think they can provide a complete list of maps that users might be interested in, especially in a library with more than 100,000 maps. Finally, searching for the right information in such a system relies extensively on human expertise. As one librarian said: "It is at the librarian's mercy whether the user can get a satisfactory answer." If current map librarians leave their positions, it would take new map librarians years to familiarize themselves with the library collection. There exists a great demand for a powerful searching tool for the library map collection.
STUDIES OF GEO-BASED RETRIEVAL TOOLS
A literature review indicates that more advanced cartographic information retrieval systems, designed for searching electronic maps, have been created and are still in the process of refinement. The Alexandria Project is probably the most well-known electronic library system dealing with topological relationships.
Smith (1996) described the goal of the Alexandria Project Digital Library (ADL) as "to build a distributed digital library (DL) for geographically-referenced materials. A central function of ADL is to provide users with access to a large range of digital materials, ranging from maps and images to text to multimedia, in terms of geographical reference" (http://www.dlib.org/dlib.org/dlib/march96/briefings/smith/ 03smith.html).
The Alexandria Atlas Subteam investigates "the design and functionality of an atlas that would support graphical/geographical access to library materials" (http://wwww.alexandria.ucsb.edu/public-documents/ annual-report97/node28.html#SECTION00051300000000000000). As the Alexandria Web site indicates, "spatial searching has not been an available service to library clients and it is not at all clear how ADL clients will react to having actual spatial data available over the Web" (http:// www.alexandria.ucsb.edu/public-documents/annual-r port97/ node28.html#SECTION00051300000000000000). The team is studying such issues as scale, data registration, search result presentation, and fuzzy footprints.
The Alexandria system supports geographical browsing and retrieval using a graphical map interface. An example of the interface can be found at <http://www.dlib.org/dlib/march96/briefings/smith/ 03smith.html>. Users can zoom in and zoom out on the current view of the map. They can select the map features they wish to see on the background map such as borders and rivers. Users can also select an area of interest and a mode of either OVERLAPS OF CONTAINS. An overview of the system is available at <http://www.alexandria.ucsb.edu/adljigi/tutorials/ walkthrough1/walkthrou>.
The prototype of GeoMatch has some new functions in addition to those available in the Alexandria system. The initiative of testing GeoMatch is to answer the following two questions: (1) can a GIS/Graphic-based retrieval tool like the Alexandria project be used for nonelectronic cartographic collections in libraries? and (2) what new functions can be developed to improve the GIS-based retrieval tool?
GEO-MATCH--A RETRIEVAL TOOL THAT SEARCHES
Figure 2 illustrates a query screen of the Geo-Match system. In addition to specifying ordinary information needs such as year, title, publisher, keyword, and so on, this system allows a user to interactively identify the interested area using a mouse. It also asks the user to specify the topological relationship between the map coverage and the user-selected area. The system accepts containment and overlapping relationships as summarized by Cobb and Petry (1998). There are two possible containment relationships--the user-selected area falls entirely within a map coverage or the coverage of a map falls within the user-selected area. Users can make a selection.
[Figure 2 ILLUSTRATION OMITTED]
If a user decides to select the overlapping relationship, more choices become available to specify quantitatively the degree of overlap. This degree includes the percentage of the overlapping area in maps and the percentage of the overlapping area in the user-selected area. If a user selects 85 percent as the overlapping criterion in the user-selected area, the user will find maps that cover most of the area of interest (Figure 3). If a user selects 85 percent as the overlapping criterion in the map coverage, the user will find maps that concentrate on the selected area (Figure 4). Users can specify how searching results should be ranked based on the degree of overlap.
[Figures 3-4 ILLUSTRATION OMITTED]
The key features of the prototype are its capability for the user to interactively identify the area of interest--i.e., to quantitatively specify the relationship between the user-defined area and the map coverage, and to rank the search results based on the degree of overlapping.
USE OF GRAPHICS TO EXPRESS INFORMATION NEED
Cartographic information is geographically referenced--it represents locations and areas on the earth. Conventional information representation using text and symbols is not very useful in describing the information included in a map there are too many geographic features included in an area. For example, a railroad map in Florida can be indexed using the keywords railroad and Florida. However, the map also includes all the railroads in each county in Florida. It indicates railroad construction in the Jacksonville area and demonstrates the railroad near Lake xxx. It is practically impossible to index all the place names included in an area. When a user draws a box to specify an area of interest, the information requested would require many words to describe it. A graphic interface can hide the coordinate numbers and present them in scalable graphics, which makes it much easier for users to discover the cartographic information resources of interest.
In addition to the information representation issue discussed earlier, a graphic interface also avoids trouble for users when changes in place names and county boundaries occur or when they simply do not know the exact name to begin the search.
LEVEL 1 IN KD--SPECIFYING TOPOLOGICAL RELATIONSHIPS QUALITATIVELY BETWEEN THE USER-DEFINED AREA AND THE MAP COVERAGE
As discussed earlier, the Alexandria Project can specify topological relationships qualitatively between the user-defined area and the map coverage in its electronic cartographic information retrieval system. This matching process goes beyond the exact matching in a conventional information retrieval system. The computer system will calculate the topological relationship between the user-defined area and the coverage of the maps to determine whether they overlap or one completely contains another.
Cobb and Petry (1998) presented a model for defining and representing binary topological and directional relationships between two-dimensional objects. Such relationships can be used for fuzzy querying. Cobb and Petry (1998) summarize that there are four kinds of major relationships--disjoint, tangent (next to each other), overlapping, and containment. The assumption for GeoMatch is that users would find overlapping and containment most useful when querying the system.
The operations involved in the above include conversion from screen coordinates to the real world coordinates and comparison of the coordinates of the corners of the user-defined area and map boundaries. The new knowledge--whether two areas overlap--is generated in this process. The knowledge acquired can be utilized to lead users to the relevant information source. GeoMatch provides users with an additional choice beyond the Alexandria system with which to define the containment relationship.
LEVEL 2 IN KD--SPECIFYING A TOPOLOGICAL RELATIONSHIP QUANTITATIVELY BETWEEN THE USER-DEFINED AREA (RECTANGLE) AND THE MAP COVERAGE
Specifying a topological relationship quantitatively between the user-defined area and the map coverage is a unique feature of the GeoMatch system. In this process, not only is the topological relationship of the two areas determined, more mathematical calculation is performed to estimate how much the two areas overlap. By combining the information input by users and the data stored in the database, the computer algorithm discovers new knowledge not explicitly represented in the database. Since the user-defined area is rectangular, the calculation involved is not overwhelming and can be realized using a conventional programming language such as C++ or Visual Basic.
This feature allows the system to achieve a higher recall and precision than those systems without this function. Gluck (1995) made an analysis of the relevance and competence in evaluating the performance of information systems. He indicated that "relevance judgments by users most often assess the qualities of retrieved materials item by item at a particular point in time and within a particular user context" (p. 447). Using the qualitative topological matching technique described in Level 1 above, there could be a large gap between the relevance of the system's view and the relevance of the user's view. For example, users may find that some retrieved maps cover only a small part of the area of interest and in fact are useless, but these maps are relevant from the system's view since they overlap the user-defined area. Users may also find that some retrieved maps cover such a large area that the area of actual interest encompasses only a small portion of the whole map. These maps are relevant too from the system's view but, again, practically useless for users. The reason for such a gap between the user's view and system's view is that not enough "knowledge" is discovered and provided for users to describe their information need in more detail. The techniques employed in the quantitative topological matching can greatly reduce the gap of relevance between the two perspectives. In addition, Geomatch can calculate the spatial relevance of the maps to the area of interest and rank the results using the quantitative overlapping factor, while many systems fail to "provide useful ordering of retrieved records" (Larson, McDonough, O'Leary, Kuntz, & Moon, 1990, p. 550). This function is particularly helpful for users when hundreds of maps are included in the result set.
LEVEL 3 IN KD--SPECIFYING TOPOLOGICAL RELATIONSHIP QUANTITATIVELY BETWEEN USER-DEFINED AREA (FREE STYLE) AND MAP COVERAGE
Specifying a topological relationship quantitatively between a user-defined area and map coverage differs from level 2 in that users are allowed to use the mouse to define an irregular area of interest rather than a straight rectangle. This feature can help users express their information need more precisely. For example, a user interested in the lake shore area of a lake can draw an irregular circle around the lake and perform a search.
This process involves complicated topological calculations that are difficult to accomplish using conventional programming languages. The GIS overlay function introduced at the beginning of this discussion needs to be used to generate new polygons and calculate the areas involved. Although the GeoMatch prototype currently does not have this feature, this function could be implemented using a third party GIS software such as the Spatial Engine from ESRI.
MARC RECORD FOR CARTOGRAPHIC INFORMATION RESOURCES
Whether an information system can be adopted depends not only on its creativity and usefulness but also on the degree of difficulty in converting the current system to the new system. MARC record format is studied to examine what new information needs to be collected to use GeoMatch.
US MARC (Machine Readable Cataloging), developed by the Library of Congress, follows the national standard (ANSI/NISO Z39.50) and international standard. It is the basic format of bibliographic description in the United States. Most online catalogs have a MARC interface for data import and export. OCLC, the bibliographic utility, also provides records in MARC format for members to share.
The current MARC format provides sufficient geographic information to support a more powerful searching tool such as GeoMatch. The most important field is Field 034--Coded Mathematical Data Area Field (Mangan, 1984). If a single set of scales is used, the first indicator is set to "1." The subfield codes include $b (ratio linear horizontal scale) Sc (ratio linear vertical scale) Sd (coordinates--westernmost longitude) Se (coordinates--easternmost longitude) $f (coordinates--northernmost latitude) and $g (coordinates--southernmost latitude). The following is an example of the MARC record 034 field:
The field above illustrates that the map covers an area from West 164 [degrees] 00'00" to West 044 [degrees] 00'00" in longitude and from North 090 [degrees] 00'00" to North 040 [degrees] 00'00" in latitude. This demonstrates that MARC records are capable of defining the scope of a map, and the data are usable in systems like GeoMatch. No additional value-adding operations are necessary unless the bibliographic record of a map is not available from the OCLC database or no matching MARC record is available for the map. If a library already has its map collection in its online catalog, all the records can be imported into GeoMatch automatically.
When librarians at the Florida State Library reviewed the prototype for GeoMatch, they realized that it could give answers to difficult questions. For example, towns may disappear over time, county boundaries may change, and users might not remember an exact place name. In such cases, GeoMatch could be very helpful.
Florida State University Library
The librarian showed interest in the GeoMatch system. She thought the system could be useful but should be integrated with the university library catalog system. When the librarian was asked whether the GeoMatch system could solve some difficult to answer questions, she provided the following example:
In summary, librarians in both libraries confirmed the need for a retrieval tool with a graphic user interface facilitating location-based searching. Such a tool is especially important when a user does not know the exact place name but knows approximately the locations of interest or when the name of a place has changed.
Nevertheless, while the librarians judged the system to be creative and potentially useful, they were not eager to implement such a system in their own libraries.
New spatial information retrieval tools are needed to improve the efficiency and effectiveness of geographically referenced searching. The GeoMatch prototype demonstrates that a graphic-based interface can mine the geographical data buried in MARC records and other geospatial sources and visualize the new knowledge discovered in these data. Combined with the text retrieval capability, this knowledge discovery tool provides users with greater flexibility in locating the information they need. Discovering knowledge in geospatial data is distinct from text information searching because it uses algorithms to convert coordinate information into user-understandable and useful knowledge.
The main contribution of GeoMatch is the quantitative analysis of the relationship in the retrieval process. Not only can it help users to more precisely define their information need and adjust the searching strategy, but it can also be used to rank the results.
The study of the MARC format shows that it supports the data requirements of GeoMatch, and no additional information is required for converting an existing online catalog to GeoMatch.
Future research in geospatial information retrieval systems will focus on the usability of the system and the theoretical framework of spatial information retrieval, including:
1. usability testing of GeoMatch to study the user friendliness and usefulness of the system
2. field testing of implementing GeoMatch in a library catalog system
3. evaluation of the efficiency and effectiveness of the quantitative overlapping function
4. design of the formula and algorithms to rank the searching result using factors from spatial comparison and factors from text information retrieval such as keywords
6. application of such a system to information sources other than paper maps, including electronic images and information that can be geographically referenced and
7. accessibility of such a system over the Web.
Results from these studies could enrich the theories in spatial information retrieval and lead to more powerful and user-friendly information retrieval tools.
Bell, D. A., & Guan, J. W. (1998). Computational methods for rough classification and discovery. Journal of the American Society for Information Science, 49(5), 403-414.
Burrough, P. A. (1990). Principles of geographical information systems for land resources assessment. Oxford: Clarendon Press.
Cheeseman, P., & Stutz, J. (1996). Bayesian classification (autoclass): Theory and results. In U. M. Fayyad (Ed.), Advances in knowledge discovery and data mining (pp. 153-180). Menlo Park, CA: AAAI Press.
Cobb, M. A., & Petry, F. E. (1998). Modeling spatial relationships within a fuzzy framework. Journal of the American Society for Information Science, 49(3), 253-266.
Environmental System Research Institute. (1991). Understanding GIS. Redland, CA: ESRI.
Fayyad, U. M. Piatetsky-Shapiro, G. & Smyth, P. (1996). From data mining to knowledge discovery: An overview. In U. M. Fayyad (Ed.), Advances in knowledge discovery and data mining (pp. 1-34). Menlo Park, CA: AAAI Press.
Glossary. (1995). Retrieved August 18, 1999 from the World Wide Web: http:// www.libraries.rutgers.edu/rulib/abtlib/alexlib/glossary-html.
Gluck, M. (1995). Understanding performance in information systems: Blending relevance and competence. Journal of the American Society for Information Science, 46(6), 446-460.
Larson, R. R. McDonough, J. O'Leary, P. Kuntz, L. & Moon, R. (1996). Cheshire II: Designing a next-generation online catalog. Journal of the American Society for Information Science, 47(7), 555-567.
Mangan, E. U. (1984). MARC conversion manual--maps: Content designation conventions and procedures for AACR2. Washington, DC: Library of Congress.
Schmitz, J. (1990). Coverstory--automated news finding in marketing. Interfaces, 20(6), 29-38.
School of Information Studies, FSU. (1999). Foundations of information studies. Retrieved May 17, 1999 from the World Wide Web: http://slis-one.lis.fsu.edu/courses/5230/.
Smith, T. R. (1996). A brief update on the Alexandria digital library project--constructing a digital library for geographically-referenced materials. Retrieved August 6, 1999 from the World Wide Web: http://alexandria.sdc.ucsb.edu.
Smith, T. R. (1998). Alexandria atlas subteam. Retrieved August 6, 1999 from the World Wide Web: http://alexandria.sdc.ucsb.edu.
Trybula, W. J. (1997). Data mining and knowledge discovery. In M. E. Williams (Ed.), Annual review of information science and technology (pp. 197-229). Medford, NJ: Information Today.
Tuzhilin, A. (1997). Editor's introduction to the special issue on knowledge discovery and its applications to business decision-making. Decision Support Systems, 21(1), 1-2.
Xu, X. W. Ester, M. Kriegel, H. P. &Sander, J. (1997). Clustering and knowledge discovery in spatial databases. Vistas in Astronomy, 41(3), 397-403.
Carter, C. L., & Hamilton, J. (1998). Efficient attribute-oriented generalization for knowledge discovery from large databases. IEEE transactions on knowledge and data engineering, 10(2), 193-208.
Chen, Z., & Zhu, Q. (1998). Query construction for user-guided knowledge discovery in databases. Journal of Information Sciences, 109(1-4), 49-64.
Connaway, L. S. Kochtanek, T. R. & Adams, D. (1994). MARC bibliographic records: Considerations and conversion procedures for microcomputer database programs. Microcomputers for Information Management, 11 (2), 69-88.
Deogun, J. S. Choubey, S. K. Raghavan, V. V. & Sever, H. (1998). Feature selection and effective classifiers. Journal of the American Society for Information Science, 49(5), 423-434.
Maddouri, M. Elloumi, S. & Jaoua, A. (1998). An incremental learning system for imprecise and uncertain knowledge discovery. Journal of Information Science, 109(1-4), 149164.
Morik, K., & Brockhausen, P. (1997). A multistrategy approach to relational knowledge discovery in databases. Machine Learning, 27(3), 287-312.
Vickery, B. (1997). Knowledge discovery from databases: An introductory review. Journal of Documentation, 53(2), 107-122.
Lixin Yu, School of Information Studies, Florida State University, Tallahassee, FL 32306-2100
LIXIN YU is an Assistant Professor at the School of Information Studies, Florida State University, where he teaches courses in database management, user interface design, and information system design and development. He worked as a Project Manager at Geosocial Resources, Inc. and has been working on Geographic Information System projects since 1990. He has published articles on GIS including "Geographic Information Systems in Library Reference Services: Development and Challenge" (Reference Librarian, February 1998) and "Assessing the Efficiency and Accuracy of Street Address Geocoding Strategies" (Proceedings of GIS '97, December 1997).3
For many projects, it would be nearly impossible to gather all of the necessary data on your own. That’s where external data sources come in. Regardless of where the data comes from, GIS software can overlay all of the information into a single, layered map.
Any information tied to a specific location can be a part of GIS data collection. According to National Geographic, there are four main categories of GIS data:
- Cartographic data: cartographic data is already in a map format and describes the location of features, the location of buildings, survey information, etc.
- Photographic data: photographic data can be used to analyze and map features from print and digital photos, satellite imagery, and aerial photography.
- Digital data: Digital data includes any information that’s already in digital format, including tables, satellite findings, and any data that’s been digitized by another GIS professional.
- Spreadsheet data: This includes information in tables and spreadsheets, which typically need to be formatted as an Excel or CSV (comma-separated values) file. Spreadsheets are often the go-to source for demographic information such as age, income levels, or even spending habits.
While there’s no shortage of public data, there’s also little to no standardization, making it difficult to find data in the right format. However, just because data isn’t formatted correctly doesn’t necessarily mean it’s unusable – it just needs to be translated.
There are two main components to translating data for GIS software, syntactic and semantic translation. Syntactic translation is by far the easier of the two, as it only involves translating symbols such as letters and numbers between systems. Semantic translation, on the other hand, is a bit more complicated. It aims to decipher the meaning behind the data, and though progress has been made, semantic translation tends not to be very accurate.
GIS Introduction by David J. Buckey
Data editing and verification is in response to the errors that arise during the encoding of spatial and non-spatial data. The editing of spatial data is a time consuming, interactive process that can take as long, if not longer, than the data input process itself.
Several kinds of errors can occur during data input. They can be classified as:
|Incompleteness of the spatial data. This includes missing points, line segments, and/or polygons.|
|Locational placement errors of spatial data. These types of errors usually are the result of careless digitizing or poor quality of the original data source.|
|Distortion of the spatial data. This kind of error is usually caused by base maps that are not scale-correct over the whole image, e.g. aerial photographs, or from material stretch, e.g. paper documents.|
|Incorrect linkages between spatial and attribute data. This type of error is commonly the result of incorrect unique identifiers (labels) being assigned during manual key in or digitizing. This may involve the assigning of an entirely wrong label to a feature, or more than one label being assigned to a feature.|
|Attribute data is wrong or incomplete. Often the attribute data does not match exactly with the spatial data. This is because they are frequently from independent sources and often different time periods. Missing data records or too many data records are the most common problems.|
The identification of errors in spatial and attribute data is often difficult. Most spatial errors become evident during the topological building process. The use of check plots to clearly determine where spatial errors exist is a common practice. Most topological building functions in GIS software clearly identify the geographic location of the error and indicate the nature of the problem. Comprehensive GIS software allows users to graphically walk through and edit the spatial errors. Others merely identify the type and coordinates of the error. Since this is often a labour intensive and time consuming process, users should consider the error correction capabilities very important during the evaluation of GIS software offerings.
Spatial Data Errors
A variety of common data problems occur in converting data into a topological structure. These stem from the original quality of the source data and the characteristics of the data capture process. Usually data is input by digitizing. Digitizing allows a user to trace spatial data from a hard copy product, e.g. a map, and have it recorded by the computer software. Most GIS software has utilities to clean the data and build a topologic structure. If the data is unclean to start with, for whatever reason, the cleaning process can be very lengthy. Interactive editing of data is a distinct reality in the data input process.
Experience indicates that in the course of any GIS project 60 to 80 % of the time required to complete the project is involved in the input, cleaning, linking, and verification of the data.
The most common problems that occur in converting data into a topological structure include:
|slivers and gaps in the line work|
|dead ends, e.g. also called dangling arcs, resulting from overshoots and undershoots in the line work and|
|bow ties or weird polygons from inappropriate closing of connecting features.|
Of course, topological errors only exist with linear and areal features. They become most evident with polygonal features. Slivers are the most common problem when cleaning data. Slivers frequently occur when coincident boundaries are digitized separately, e.g. once each for adjacent forest stands, once for a lake and once for the stand boundary, or after polygon overlay. Slivers often appear when combining data from different sources, e.g. forest inventory, soils, and hydrography. It is advisable to digitize data layers with respect to an existing data layer, e.g. hydrography, rather than attempting to match data layers later. A proper plan and definition of priorities for inputting data layers will save many hours of interactive editing and cleaning.
Dead ends usually occur when data has been digitized in a spaghetti mode, or without snapping to existing nodes. Most GIS software will clean up undershoots and overshoots based on a user defined tolerance, e.g. distance. The definition of an inappropriate distance often leads to the formation of bow ties or weird polygons during topological building. Tolerances that are too large will force arcs to snap one another that should not be connected. The result is small polygons called bow ties. The definition of a proper tolerance for cleaning requires an understanding of the scale and accuracy of the data set.
The other problem that commonly occurs when building a topologic data structure is duplicate lines. These usually occur when data has been digitized or converted from a CAD system. The lack of topology in these type of drafting systems permits the inadvertent creation of elements that are exactly duplicate. However, most GIS packages afford automatic elimination of duplicate elements during the topological building process. Accordingly, it may not be a concern with vector based GIS software. Users should be aware of the duplicate element that retraces itself, e.g. a three vertice line where the first point is also the last point. Some GIS packages do not identify these feature inconsistencies and will build such a feature as a valid polygon. This is because the topological definition is mathematically correct, however it is not geographically correct. Most GIS software will provide the capability to eliminate bow ties and slivers by means of a feature elimination command based on area, e.g. polygons less than 100 square metres. The ability to define custom topological error scenarios and provide for semi-automated correction is a desirable capability for GIS software.
The adjoining figure illustrates some typical errors described above. Can you spot them ? They include undershoots, overshoots, bow ties, and slivers. Most bow ties occur when inappropriate tolerances are used during the automated cleaning of data that contains many overshoots. This particular set of spatial data is a prime candidate for numerous bow tie polygons.
Attribute Data Errors
The identification of attribute data errors is usually not as simple as spatial errors. This is especially true if these errors are attributed to the quality or reliability of the data. Errors as such usually do not surface until later on in the GIS processing. Solutions to these type of problems are much more complex and often do not exist entirely. It is much more difficult to spot errors in attribute data when the values are syntactically good, but incorrect.
Simple errors of linkage, e.g. missing or duplicate records, become evident during the linking operation between spatial and attribute data. Again, most GIS software contains functions that check for and clearly identify problems of linkage during attempted operations. This is also an area of consideration when evaluating GIS software.
Six clear steps stand out in the data editing and verification process for spatial data. These are:
Visual review. This is usually by check plotting.
These data verification steps occur after the data input stage and prior to or during the linkage of the spatial data to the attributes. Data verification ensures the integrity between the spatial and attribute data. Verification should include some brief querying of attributes and cross checking against known values.
We will continue to use the COVID-19 dataset. Please see Chapter 11 for details on the data.
Using these data, you are required to address the following challenges:
Fit a varying-slope model. Let one slope to vary by region. Think carefully your choice.
Fit a varying-intercept and varying-slope model.
Compare the results for models fitted in 1 and 2. Which is better? Why?
Use the same explanatory variables used for the Chapter 7 challenge, so you can compare the model results from this chapter.
2 Answers 2
Inspired by @dk14 's answer, now I have a clearer mind on this question, though I don't completely agree with his answer. And I hope to post mine online for more confirmation.
On a vanilla case, where the input of original AlexNet is still (224,224,3), after a series of Conv layer and pooling, we reach the last Conv layer. At this moment, the size of the image turns into (7,7,512).
At the converted Conv layer(converted from FC1), we have 4096 * (7,7,512) filters overall, which generates (1,1,4096) vector for us. At the second converted Conv layer(converted from FC2), we have 4096 * (1,1,4096) filters, and they give us a output vector (1,1,4096). It's very important for us to remember that, in the conversion, filter size must match the input volume size. That's why we have one by one filter here. Similarily, the last converted Conv layer have 1000 * (1,1,4096) filters and will give us a result for 1000 classes.
The processed is summarized in the post: http://cs231n.github.io/convolutional-networks/#convert.
In FC1, the original matrix size should be (7*7*512, 4096), meaning each one of the 4096 neuron in FC2 is connected with every neuron in FC1. While after conversion, the matrix size becomes (7,7,512,4096), meaning we have 4096 (7,7,512) matrixes. It's like taking out each row of the original gigantic matrix, and reshape it accordingly.
Let's start with $F = 7$, $P = 0$, $S = 1$ notion. What does it actually mean:
$F = 7$: receptive field size is set to a maximum value (7 for 1D, 7x7 for 2D) which implies no parameter sharing (as there is only one receptive field), which is default for MLP. If F was equal to 1, all connections (from the image above) would always have an identical weight.
$S = 1$: stride equals to 1, which means that no neurons on the next layer is going to be removed (see figure below). Given $F = 7$ if we had stride = 2, the number of next-layer nodes would be twice smaller. Source: http://cs231n.github.io/convolutional-networks
$P = 0$: no zero padding, as we don't need it for a full receptive field (there is no uncovered units as you can see from image above).
Those three conditions basically guarantee that connectivity architecture is exactly same as for canonical MLP.
Attempt to answer your question about reshaping matrices:
Example of reshaping in Python's Numpy library: numpy.reshape
My guess is that the author meant that FCN usually has 1D output "vector" (from each layer) instead of 2D matrix. Let's say, the first layer of FC-network returns 1x1x4096 output matrix as it doesn't care about image's dimensions - it stacks all dimensions into one vector (put each rows on top of another). You can guess that next layer's weight matrix is gonna have corresponding shape (4096x4096) that combines all possible outputs). So when you convert it to a convolutional receptive field - you'll probably have to move your activations to 2D, so you need 64x64 activations and, I guess, something like 64x64x4096 tensor for receptive field's weights (since $S=1$).
The quote from the article that demonstrates "reshaping":
For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we’re now getting and entire 6x6 array of class scores across the 384x384 image
Example (for activations of some layer):
In order to show weights reshaping (to fit 2D image), I'd have to draw square into cube conversion. However, there is some demos on the internet:
P.S. However, I have some confusion about AlexNet example: it seems like mentioned $F=1$ just means "full" parameter sharing across non-existent dimensions (1x1). Otherwise, it won't be completely equivalent to an MLP with no parameter sharing - but maybe that's what was implied (scaling small FC-network into a large CNN).
to “slide” the original ConvNet very efficiently across many spatial positions in a larger image
Basically it allows you to scale a FC-network trained on small portions/images into a larger CNN. So in that case only small window of resulting CNN will be initially equivalent to an original FCN. This approach gives you ability to share parameters (learned from small networks) across large networks in order to save computational resources and apply some kind of regularization (by managing network's capacity).
Edit1 in response to your comment.
Example of $N = 5$ (sorry I was lazy to draw 7 neurons), $F=5$, $S=2$ :
So you can see that S = 2 can be applied even for receptive field with maximum size, so striding can be applied without parameter sharing as all it does is just removing neurons.
And parameter sharing strategies could be different. For instance, you can't tell about my last figure wether parameter are shared between neurons or not.
Predictive Ecosystem Mapping (PEM) Detailed Polygons with Short Attribute Table - 50,000 Spatial View
PEM_50K contains 1 to 50,000 PEM polygons with key and amalgamated (concatenated) attributes derived from the Resource Inventory Standards Committee (RISC) standard attributes. PEM divides the landscape into units according to a variety of ecological features including climate, physiography, surficial material, bedrock geology, soils and vegetation. PEM uses a modeling approach to ecosystem mapping, whereby existing knowledge of ecosystem attributes and relationships are used to predict ecosystem representation in the landscape. This layer is derived from the STE_TEI_ATTRIBUTE_POLYS_SP layer by filtering on the PROJECT_TYPE and PROJECT_MAP_SCALE attributes.
- bioterrain mapping
- describing terrestr.
- ecosystem mapping
- ecosystem modelling
- predictive ecosyste.
- sensitive ecosystem
- sensitive ecosystem.
- slope stability
- terrain and ecosystems
- terrain mapping
- terrain stability
- terrain stability m.
- terrestrial ecosyst.
- wildlife habitat ra.
- wildlife inventory
Data and Resources
The PEM data in geodatabase format is available in the TEI Data Distribution.
This driver supports the GDALDriver::Create() operation
This driver supports georeferencing
KML reading is only available if GDAL/OGR is built with the Expat XML Parser, otherwise only KML writing will be supported.
Supported geometry types are Point , Linestring , Polygon , MultiPoint , MultiLineString , MultiPolygon and MultiGeometry . There are limitations, for example: the nested nature of folders in a source KML file is lost folder <description> tags will not carry through to output. Folders containing multiple geometry types, like POINT and POLYGON, are supported.
Since not all features of KML are able to be represented in the Simple Features geometry model, you will not be able to generate many KML-specific attributes from within GDAL/OGR. Please try a few test files to get a sense of what is possible.
When outputting KML, the OGR KML driver will translate each OGR Layer into a KML Folder (you may encounter unexpected behavior if you try to mix the geometry types of elements in a layer, e.g. LINESTRING and POINT data).
The KML Driver will rename some layers, or source KML folder names, into new names it considers valid, for example ‘ Layer #0 ’, the default name of the first unnamed Layer, becomes 'Layer__0' .
KML is mix of formatting and feature data. The <description> tag of a Placemark will be displayed in most geobrowsers as an HTML-filled balloon. When writing KML, Layer element attributes are added as simple schema fields. This best preserves feature type information.
Limited support is available for fills, line color and other styling attributes. Please try a few sample files to get a better sense of actual behavior.