Skip to content

SQL to Neo4j

Orieus edited this page Sep 2, 2020 · 1 revision

Troubleshooting

UTF-8 encoding

Every piece of software is using UTF-8 to handle data. If the encoding of the terminal used to run the scripts is not setup to UTF-8 too, errors might happen. In order to play it safe, you can add

export LC_ALL=es_ES.utf8

to your .bash_profile assuming you are using bash and the machine supports es_ES.utf8.

Disambiguation

Huge tables

A table that doesn't fit in memory may be a problem when trying to disambiguate authors and/or orgnanizations: it must be read and written to a csv blockwise, and it might happen than an author read at e.g., the 1000th block is mapped (disambiguated) to one read at the 1st block. The newly read author should be dismissed (since it was already written before, and the disambiguated_id must be unique), but in order to know that you need to keep tabs on the authors already read.

Solution

Tables on which disambiguation is to be applied are read in one go.

Merging

When merging into neo4j authors/organizations, it must be taken into account that neo4j only merges a node (into an existing one) if the properties that are present in both exactly match. This means that if two nodes have the same disambiguated_id, but differ in the property name (e.g., Ana in one of them and A. in the other), they will not be merged, but neo4j will try to create a new node for the newcomer. This will give rise to an error since two nodes with the same disambiguated_id are not allowed, the latter being a unique property.

Solution

Properties with the same meaning (e.g., name) can have different names depending on the table (name_patents, name_projects...). After the import process is over, this can be fixed.

Clone this wiki locally