January 9, 2013
Management of Open-Data Cycle
Published by:
yaso
Tags:
cicle, control, Data Catalog, Linked Open Data, Open Government Data, planning, visualization
Permalink
Sem comentários
The cycle involving open data, whether governmental or not, is large and needs to be observed to ensure that publishing initiatives and open data are sustainable. In other words, for the benefits promised by scholars who promote the open data philosophy to become something concrete to humanity, it is necessary to observe all points involving the publication of open data on the Web.
Below is a chart to illustrate better the relationships between the main stages involved in data publishing.
(this graphic is published in an open vector format – svg – on GitHub)
According to the graph, the cycle that given data follows may begin from the COLLECTION stage.
An information society age with hardware devices available for relatively affordable prices offers an abundance of possibilities for information sources to be creative when accumulating data that may serve to benefit people in the near future.
Collection Situations
Spontaneously provided forms, sensors, and data and others may be used to establish interesting bases. It is important to note that monitoring should not rely on people, but on the environment in which they live. If any information depends on a human to be provided, it should always be protected by privacy laws, and contributing should be unenforced and free, with no obligations. The advantage of designing data collection in a given situation is the possibility of providing facilitation tools for in all other phases. This means that if the collection solution is planned, it can contain important elements, such as semantic web resources and accessibility that can save labor (particularly in use and reuse phases) and innovatively qualify the database in question.
A good example of this is the base used for the “Open Self Medication” application, which combines semantic databases on drugs, their compositional substances, and symptoms for each disease, relating each case in different ways that can, for example, decrease the error at the moment of prescribing a drug.
Extraction Situations
Extraction situations must be performed in closed-data publishing. Almost always these are closed and proprietary or unstructured formats with fields that are either missing or containing erroneous or useless information. It is no wonder that the process for making such data usable and analyzing it is called mining. Likewise, the process of extracting public data from closed or limited-access databases via coding activities is known not only as hacking but also as web scrapping.
Several methods and tools exist for extraction. The DadosGovBr GitHub contains a repository of very useful tools that documents some related tools.
Storage/Publication/Distribution of Open Data
Once one has the data in hand, if the intention is to publish or republish, they must be stored in repositories structured and designed to receive and distribute such data in an open and interoperable manner. These are the catalogs.
Forming consistent catalogs requires some basic rules that were initially defined to catalog library content. With the advent of the information age, they were transposed and adapted to the data catalog formation context. Users can utilize catalogs to refine searches or help in interpreting entries. According to the Brazilian Internet Steering Committee:
The semantics of information must be agreed in advance, so that all parties have a common understanding of the meaning of the data exchanged. At the international level, this may be a complex issue, since certain legal concepts differ from one country to another. The ultimate goal is to be able to interpret data evenly between the different platforms and organizations involved in data exchange. To do this, it would be useful to publish on the Web the names and definitions of the elements used in a shareable and referenceable format, regardless of the degree of support obtained.
Semantic qualification of data can add much to the chosen bank. Thesauruses, taxonomies, vocabularies, classification schemes, among others, are resources used to produce 5 star data. Selecting or constructing these tools requires fully understanding some of the Web standards and tools for data catalog storage and publishing.
One of the most frequently used tools for viewing catalogs is CKAN, which is also a publishing, storage and management tool for data sets. CKAN is free software, developed and maintained by a community, which means it has no cost and a very positive learning curve. Dados.gov.br currently uses this software to maintain the Brazilian government’s open-data portal
Regardless of the software suite adopted for storage and publishing, it is important to include in the open-data cycle planning certain concept standards. They are (taken from here)
- URI: a resource identifier used to identify or locate something on the Web
- A URL is a URI that identifies a resource and provides a means to act on it, obtain and/or represent this resource, describing its primary access mechanism or location on the “web”. A URL is a URI that identifies a resource and provides a means to act on it, obtain and/or represent this resource, describing its primary access mechanism or location on the “web”.
For example, the URL http://www.w3c.br/ is a URI that identifies a resource (the W3c Brasil website), represents this resource (the HTML page, for example), and is available via HTTP from a network host (http://www.w3c.br). - Below is a diagram showing the structure of a URI (Taken from this site).
- A URL is a URI that identifies a resource and provides a means to act on it, obtain and/or represent this resource, describing its primary access mechanism or location on the “web”. A URL is a URI that identifies a resource and provides a means to act on it, obtain and/or represent this resource, describing its primary access mechanism or location on the “web”.
- RDF/XML: XML is a W3C standard format for creating documents with data organized in a hierarchical fashion, as often seen in formatted text documents, vector images, or databases.
- SPARQL: “sparkle”, also recommended by the W3C and administered by the W3C Semantic Web Groups, is used to search for information independent of the format of the results. One can also use SPARQL to work with data in RDF.
There are standards for publishing data in open format. To provide an interoperable environment in all e-gov domains, it is imperative that laws and/or governmental recommendations specify and regulate these standards. When data is dynamic, interoperable, and fed systemically, costs are reduced and processes are incorporated in administration routines (such as completing tables in Excel). In practice, this means that planning may be slow and expensive, but it also reduces the cost of maintaining a sustainable environment.
The Importance of APIs
When it comes to large volumes of dynamic and open data, the best way to plan to open one’s data is to include a conversation on APIs and seriously consider using them. APIs are intended to be used as an interface by software components to communicate with each other.
API stands for Application Programming Interface. An API is a set of predetermined programming rules that enables creating applications that use these rules to obtain data in layers that do not appear to the average user. They connect and continue “working”, interoperating multiple systems and applications when data is requested. APIs should be open and transparent so that developers may access them and suggest new features to improve their applications.
The presentation that accompanies this part of the course is available for download here and available to read online.
Use / reuse
Continued in the next post, on visualizations and applications.