Skip to content

Minimal CSV requirements and sanitization  #2

@ppKrauss

Description

@ppKrauss

AddressForAll recives work from municipalities and other institutions, that is, recive works from "original-data authorities". Is usual to not work with open standards, so, to avoid big barriers, the "AddressForAll reciver agent" will be accept into some minimal constraints:

  1. The file is a valid MIME text/csv as defined by RFC 7111.

  2. The file has a header line and authority define another or confirm that , is the field separator.

  3. The authority say the number of lines and number of bytes of the file. And application that generated the file (e.g. MS-Excel, Autocad, etc.)

After a basic treatment at terminal, by unzip and dos2unix file.csv, check metadata and charset:

  • wc -l file.csv to confirm number of lines
  • file -i file.csv to check supposed charset
  • grep -axv '.*' file.csv to check lines with strange characters

If no problem, jump to PostgreSQL ingestion, else the "AddressForAll reciver agent" decide if will edit and fix the problems (less tham half hour of work) or it is necessary to report problems and request a new file.

The most frequent fix procedure is to enforce UTF-8:

iconv -c -t UTF-8  < file.csv > new_file.csv

At PostgreSQL define minimal ingestion standards. For example for São Vicente (SP) used

CREATE or replace FUNCTION stable.br_housenumber(p text) RETURNS int AS $f$
  SELECT CASE 
      WHEN x~ '^\d+$' THEN x::int
      WHEN x~ '\d' THEN regexp_replace(x, '^[^\d]*(\d+).+$', '-\1')::int
      ELSE NULL
   END
  FROM (select trim(p)) t(x)
$f$ LANGUAGE SQL IMMUTABLE;
COMMENT ON FUNCTION stable.br_housenumber(text)
 IS 'Sanitize houseNumber by trim, on error changes to negative or NULL.';

CREATE FOREIGN TABLE ingest.brsp_tipo01 (
	houseNumber text,
	point_SRID29193 text
) SERVER files OPTIONS (
	filename '/tmp/pg_io/mapa01_loteamentos-pontos-Oficial.csv',
	format 'csv',
	header 'true'
);
CREATE VIEW ingest.vw_brsp_tipo01 AS 
  SELECT stable.br_housenumber(housenumber) AS  housenumber,
          ST_PointFromText(point_SRID29193, 29193 ) AS geom
  FROM ingest.brsp_tipo01
;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions