sde subtype hell
Getting the database structure in PostgreSQL is pretty straight-forward and helps to do some really neat things like dynamic SQL parsing with type-casts and so on. For example to get all columns of a specific table with the column order, column names, their formatted datatype expressions, nullability, and default values you can go like:
select
ns.nspname as schema_name,
c.relname as table_name,
a.attnum as ord,
a.attname as column_name,
format_type(a.atttypid, a.atttypmod) as data_type,
a.attnotnull as not_null,
pg_get_expr(def.adbin, def.adrelid) as default_value
from
pg_namespace ns,
pg_class c,
pg_attribute a
left join pg_attrdef def on
a.attrelid = def.adrelid and
a.attnum = def.adnum
where
c.relnamespace = ns.oid and
c.oid = a.attrelid and
a.attnum > 0 and
a.attisdropped = false and
c.oid = 'hexgrid'::regclass::oid
order by attnum
which will output something in the lines of
Cool, so when inserting data into a PostgreSQL backed ESRI sde database from e.g. csv files we can use the database to tell us all of this?
No. Sorry, not everything. You can get a basic set of datatypes (varchar
,
int
, smallint
, timestamp
, st_geometry
) and the nullability, but
not for example default values which are stored in sde system tables. This is
most probably due to suptypes which for a particular table all go into the
same “physical” table but might have different defaults assigned for a particular
column. So how do you go about this?
This can be achieved by parsing the xml definition from the corresponding row
in sde.gdb_items
table. But first let’s look at domains.
These are essentially codelists that you can assign to table columns (and by saying “table” I mean both in ESRI parlance: tables and feature classes, as essentially these are the same) and consist of a code and a corresponding label. The code value is stored in the table, label is shown to the user. If you input (by hand) a wrong code value, your ESRI-based frontend will not know how to label it. All’s good for the database but people sucked into esri applications are still complaining.
One way to get over this when inserting raw data from csv files is by checking the domains available during insert. And maybe you don’t have codes, but the labels instead. So instead of doing
insert into foo.bar (
objectid, some_domain_column
)
select
sde.next_rowid64('foo', 'bar') as objectid,
case
when rows.some_domain_column =
'The answer to life, the universe, and everything'
then 42
else -1
end as some_domain_column
from (
-- the following should be a table select but lets just pretent for a second
values (
'The answer to life, the universe, and everything'
)
) as rows (some_domain_column)
;
and then later discovering that 42
is not defined in the domains. Or maybe the
codes change so quickly that it’s hard labour coding all of this into case-when
statements. But we can prepare domains as a database view:
create or replace view public.v_sde_domains as
select
i.name as domain_name, x.*
from
sde.gdb_items i,
sde.gdb_itemtypes t,
xmltable(
'/GPCodedValueDomain2/CodedValues/CodedValue'
passing i.definition
columns
code text path 'Code' not null,
name text path 'Name'
) x
where
t.uuid = i.type and
t.name = 'Coded Value Domain'
;
And we can then use this in the previous insert query as:
select
some_domain_column.code
from (
-- the following should be a table select but lets just pretent for a second
values (
'The answer to life, the universe, and everything'
)
) as rows (some_domain_column)
left join
public.v_sde_domains some_domain_column on
some_domain_column.domain_name = 'the_answer' and
some_domain_column.name = rows.some_domain_column
;
So even if the domain codes change very quickly and undecidedly, we can rest assured that we’ll be getting correct codes for existing values, just as they are in the database. We can even mark the mistakes this way. For example we can discern between a missing value in the input data and a messed up value in the input data with:
select
case
when rows.some_domain_column is not null
then coalesce(
some_domain_column.code,
-1
)
else -2
end as some_domain_column
from
[..]
assuming you want messed up values to be marked with -1
and missing values
with -2
. Although my personal preference for marking missing values has
always been NULL
, but somehow for whatever reasons, personal or
data-management-provider based, I see that there are people who don’t like
NULL
s.
Now with domains it seems pretty straight forward, these are separate “objects” floating somewhere out there in the database. It is important to keep track of “domain owner” and “domain name” values as these might get messed up. So for example if there’s multiple schemas and you pass your datamodels via the XMLWorkspace document (export from one DB, import to another), then you should pay attention to in what order are you exporting and importing your xmls, and that the domain definitions remain the same in all xmls you’ve made of all schemas.
Although (at least in sde-exported xmls) there’s a thing called “domain owner”, but it seems really a waste of textfile bytes as the domains list contains all the domains from the db all the time. At least I’ve not seen any effect nor point in that.
Btw, this here sounds like another reason why not to use schemas for sde databases - it absolutely unnecessarily complicates DB admin / dev-ops activities.
But anyway, subtypes. Using subtypes in this way is a bit more complicated, but
also possible. Unlike domains, subtypes are bound to the table definition, they
don’t exist independently of their respective tables. So in order to get them
we’ll have to query and parse the definition
xml from the sde.gdb_items
for every table. And as I said before, when I say table, I mean table
and feature class as apparently these have differing root xpath
expressions in the definition xml…
select
r[2] as nsname, r[3] as relname,
coalesce(fe.col_name, ta.col_name) as subtype_col_name,
coalesce(fe.default_subtype, ta.default_subtype) as default_subtype,
s.subtype_name, s.subtype_code,
c.attname, nullif(c.domain_name, ''), c.default_value
from
sde.gdb_items i,
sde.gdb_itemtypes t
left join lateral
xmltable(
'/DETableInfo'
passing i.definition
columns
default_subtype text path 'DefaultSubtypeCode',
col_name text path 'SubtypeFieldName',
subtypes xml path 'Subtypes'
) ta on true
left join lateral
xmltable(
'/DEFeatureClassInfo'
passing i.definition
columns
default_subtype text path 'DefaultSubtypeCode',
col_name text path 'SubtypeFieldName',
subtypes xml path 'Subtypes'
) fe on true
join lateral
xmltable(
'/Subtypes/Subtype'
passing coalesce(fe.subtypes, ta.subtypes)
columns
subtype_name text path 'SubtypeName',
subtype_code int path 'SubtypeCode',
fieldinfos xml path 'FieldInfos'
) s on true
join lateral
xmltable(
'/FieldInfos/SubtypeFieldInfo'
passing s.fieldinfos
columns
attname text path 'FieldName',
domain_name text path 'DomainName',
default_value text path 'DefaultValue'
) c on true
join lateral
string_to_array(i.name,'.') r on true
where
t.uuid = i.type and
t.name in ('Table', 'Feature Class')
;
This view would list us all tables in the db that have a subtype, particular domains assigned to columns and their default values for that subtype. So we get multiple rows per table but one for every column (in a subtype). It might be good for overview, but maybe it would be easier to use in subsequent queries if we took advantage of PostgreSQL’s JSON functionality. Essentially aggregating all columns and subtypes into a JSON blob so we end up with one row per table (or to be more precise: subtype column):
create or replace view public.v_sde_subtypes as
select
nsname, relname, subtype_col_name, default_subtype,
jsonb_object_agg(subtype_name, subtype_code) as subtype,
jsonb_object_agg(subtype_name, col) as cols_per_subtype
from (
select
r[2] as nsname, r[3] as relname,
coalesce(fe.col_name, ta.col_name) as subtype_col_name,
coalesce(fe.default_subtype, ta.default_subtype) as default_subtype,
s.subtype_name, s.subtype_code,
jsonb_object_agg(
c.attname,
jsonb_build_object(
'domain',
nullif(c.domain_name, ''),
'default',
c.default_value
)
) as col
from
sde.gdb_items i,
sde.gdb_itemtypes t
left join lateral
xmltable(
'/DETableInfo'
passing i.definition
columns
default_subtype text path 'DefaultSubtypeCode',
col_name text path 'SubtypeFieldName',
subtypes xml path 'Subtypes'
) ta on true
left join lateral
xmltable(
'/DEFeatureClassInfo'
passing i.definition
columns
default_subtype text path 'DefaultSubtypeCode',
col_name text path 'SubtypeFieldName',
subtypes xml path 'Subtypes'
) fe on true
join lateral
xmltable(
'/Subtypes/Subtype'
passing coalesce(fe.subtypes, ta.subtypes)
columns
subtype_name text path 'SubtypeName',
subtype_code int path 'SubtypeCode',
fieldinfos xml path 'FieldInfos'
) s on true
join lateral
xmltable(
'/FieldInfos/SubtypeFieldInfo'
passing s.fieldinfos
columns
attname text path 'FieldName',
domain_name text path 'DomainName',
default_value text path 'DefaultValue'
) c on true
join lateral
string_to_array(i.name,'.') r on true
where
t.uuid = i.type and
t.name in ('Table', 'Feature Class')
group by
1,2,3,4,5,6
) d
group by
nsname, relname, subtype_col_name,
default_subtype
;
We can use this in queries like the one we did before. Imagine the table
that we were inserting to has a subtype field assigned to, let’s say
a column called type
and that classifies if it’s a Question
or an
Answer
(so a subtype field). Let’s select a question again (see comments in
the SQL for some explanations):
select
case
when rows.type is not null
/* get subtype code for value if it exists and is correct,
othewise encode it as -1
*/
then coalesce(
(subtype.subtype->>rows.type)::int,
-1
)
else
/* ..or if there's none in the input, use subtype default,
otherwise encode it as -2. Everything goes red in
in arcgis.
*/
coalesce(
subtype.default_subtype::int,
-2
)
end as type,
case
when rows.some_domain_column is not null
/* there's an inbound value for column foo.bar.some_domain_column.
using the "assigned domains" data we have from left joining
v_sde_domains to v_sde_subtypes get the appropriate code
value for the domain for this column in this subtype.
-1 here means that
there is no appropriate code in domain we should be using
(messed up data)
*/
then coalesce(
some_domain_column.code::int,
-1
)
else
/* there was no input value for column. Use the default value
assigned for this column in this subtype. -2 here means that
there was no default value set for the column */
coalesce(
(
(
(
subtype.cols_per_subtype->>rows.type
)::jsonb->>'some_domain_column'
)::jsonb->>'default'
)::int,
-2
)
end as some_domain_column
from (
-- the following should be a table select but lets just pretent for a second
values (
'Question',
'The answer to life, the universe, and everything'
)
) as rows (type, some_domain_column)
left join
public.v_sde_subtypes subtype on
subtype.nsname = 'foo' and
subtype.relname = 'bar'
left join
public.v_sde_domains some_domain_column on
/* subtype.cols_per_subtype["<subtype-name>"]["<column-name>"]["domain"]
gives the domain name: expectedly `the_answer`
but we don't have to hard-code it here
*/
some_domain_column.domain_name = (
(subtype.cols_per_subtype->>rows.type)::jsonb->>'some_domain_column'
)::jsonb->>'domain' and
some_domain_column.name = rows.some_domain_column
;
This final select could be made even more elaborate with e.g. NOT NULL
checking
for insert columns too - so if there’s no default value and the column allows
NULL
s there’s no apparent reason why we should insert a value to that column.
The used error code (-2) should actually go only to those columns where
it really is needed.
But why am I writing this? Personal reflection, most probably. And a few things to remember:
- importing data using ESRI tools will guard against ill-managed domains and subtypes but is woefully slow. And it doesn’t allow for very much control on how to handle errors or smallish changes, or recodings in a simple no-nonsense way during transferral of data from one table to another table. At least for such a simple minded user as myself.
- using
psql
to import data from csv files is blazing fast (compared to E), and SQL (especially if you have PostGIS enabled in your PostgreSQL database) gives you enormous power over what can be done or fixed automatically, repeatedly, running the whole import process over and over again until you like the results that you see. - And here’s a couple of extra things: You want to
preserve ESRI globalid values for table rows? Sure! You want to preserve the
ESRI editor tracking data? Sure! If for whatever reason you want to retain
the original ESRI objectid values? Of course you can. But only via the
psql
and csv approach, not ESRI itself. At least not straight out of the box. - Did you know that ogr2ogr CSV driver can be configured to write all kinds of geospatial data into csv files too? :)
(sorry, was only teasing with this last point…)