Wikibase/Indexing/Benchmarks

Titan benchmarks

edit

Made on einsteinium with external cassandra cluster.

Shorter lookups

edit

These are short lookups that must be fast.

Checking random element without fetching property

edit
w.measure(10000) { def a = g.V('wikibaseId','Q'+(random.nextInt(10000000) as String)).hasNext(); }

[18816, 13342, 15188, 12626, 12289]

Average: 14452.2

Time: 1.44522 ms

Checking random element

edit
w.benchmark { 10000.times { def a = g.V('wikibaseId','Q'+(random.nextInt(10000000) as String)).labelEn.hasNext(); } }

[39330, 28555, 30037, 27755, 35049]

Average: 32145.2

Time: 3.21452 ms

Checking fixed node

edit

This mostly measured cache performance.

w.measure(10000) { a = g.V('wikibaseId', 'Q30').labelEn.hasNext() }

[10889, 9779, 8969, 8930, 9467]

Average: 9606.8

Time: 0.9ms

Checking supernode

edit

This mostly measured cache performance, but for supernode that has tons of incoming edges.

w.measure(10000)  { def a = g.V('wikibaseId', 'Q5').labelEn.next(); } 

[9611, 8339, 8174, 8360, 8815]

Average: 8659.8

Time: 0.8ms

Checking supernode out - first human

edit

Navigating "wide" link out of supernode.

w.measure(100) { def a = g.V('wikibaseId', 'Q5').in("P31")[0].next(); }

[8689, 7015, 7194, 8082, 8515]

Average: 7899

Time: 0.7899 ms

Random human

edit

This may stretch the cache a little more, but still be cacheable.

w.measure(10000) { def a = g.V('wikibaseId', 'Q5').in("P31")[random.nextInt(10000)].next(); }

[21395, 21192, 21288, 20017, 21699]

Average: 21118.2

Time: 2.11182 ms

Random human with name, bigger spread

edit

This is probably outside of current cache size. Also, [] probably does linear scan, so it behaves worse quadratically, as expected.

w.measure(100) { def a = g.V('wikibaseId', 'Q5').in("P31")[random.nextInt(100000)].labelEn.next(); }

[27543, 24389, 24191, 23185, 26852]

Average: 25232

Time: 252.32 ms

Random human with name - cached

edit
def a = g.listOf('Q5')[0].next()

Check if random entry is a human - non-cached

edit

This is using "out" link to Q5.

w.measure(1000) { def a = g.V('wikibaseId', 'Q'+(random.nextInt(10000000) as String)).out("P31").has('wikibaseId', 'Q5').hasNext(); }

[6509, 3882, 4626, 4165, 3371]

Average: 4510.6

Time: 4.5106 ms

Check if random entry is a human - cached

edit

This uses "link" property on the vertex itself. Surprisingly, not much difference! 

w.measure(10000) { def a = g.V('wikibaseId', 'Q'+(random.nextInt(10000000) as String)).has('P31link', CONTAINS, 'Q5').hasNext(); }

[54131, 52634, 43485, 41180, 44011]

Average: 47088.2

Time: 4.70882 ms

Check if random entry is human and not disambiguation

edit

Simplistic approach - just go by out links w.measure(1000) { def a = g.V('wikibaseId', 'Q'+(random.nextInt(10000000) as String)).as('x').out("P31").has('wikibaseId', 'Q5').back('x').filter{!it.out('P31').has('wikibaseId', 'Q4167410').hasNext()}.hasNext(); } [9069, 7610, 5076, 4825, 6499]

Average: 6615.8

Time: 6.6158 ms

More sophisticated condition handling using link property: w.measure(1000) { def a = g.V('wikibaseId', 'Q'+(random.nextInt(10000000) as String)).filter{'Q5' in it.P31link && !('Q4167410' in it.P31link);}.hasNext(); } [4489, 3696, 3677, 3597, 3480]

Average: 3787.8

Time: 3.7878 ms

Collect 1000 non-empty names

edit

Using link property:

w.measure(1000) {t = []; g.V('P31link', 'Q5').labelEn.filter{it != null}[0..1000].aggregate(t).iterate(); assert t.size() == 1001;}

[29682, 29685, 31022, 30879, 28966]

Average: 30046.8

Time: 30.0468 ms

Using "in" edge. Now there's a big difference:

w.measure(100) {t = []; g.V('wikibaseId', 'Q5').in('P31').labelEn.filter{it != null}[0..1000].aggregate(t).iterate(); assert t.size() == 1001;}

[13203, 11387, 11429, 11385, 11359]

Average: 11752.6

Time: 117.526 ms

Find country

edit

This would be heavily cached.

w.measure(1000) { def a = g.V('wikibaseId', 'Q1013639').toCountry().labelEn.next(); }

[2905, 2625, 2504, 2358, 2436]

Average: 2565.6

Time: 2.5656 ms

Find country of random neighborhood

edit

This one may have less luck with caching.

w.measure(100) { def a = g.listOf('Q123705').shuffle()[0].toCountry().labelEn.hasNext(); }

[17432, 17212, 16752, 16681, 16310]

Average: 16877.4

Time: 168.774 ms

Check if random neighborhood is in Finland?

edit
w.measure(100) { g.listOf('Q123705').shuffle()[0].toCountry().has('wikibaseId', 'Q33').hasNext(); }

[17707, 17807, 17310, 17461, 18288]

Average: 17714.6

Time: 177.146 ms

Longer list queries

edit

These may generate long lists and are expected to be slower.

List of countries by population

edit

The list is small, so most probably it's cacheable.

w.measure(100) { t= []; g.listOf('Q6256').as('c').groupBy{it}{it.claimValues('P1082').preferred().latest()}.cap.scatter.filter{it.value.size()>0}.transform{it.value = it.value.P1082value.collect{it?it as int:0}.max(); it}.order{it.b.value <=> it.a.value}.transform{[it.key.wikibaseId, it.key.labelEn, it.value]}.aggregate(t).iterate(); } 

[2885, 2838, 2811, 2803, 2776]

Average: 2822.6

Time: 28.226 ms

List of all occupations

edit

Probably caches too.

w.measure(100) { t = []; g.wd('Q28640').treeIn('P279').instances().dedup().aggregate(t).iterate(); assert t.size() == 2777}

[4647, 4530, 4593, 4549, 4479]

Average: 4559.6

Time: 45.596 ms

List of potential nationalities

edit

WDQ produces 571815 results.

g.listOf('Q5').as('humans').claimValues('P569').filter{it.P569value != 'somevalue' && it.P569value > Date.parse('yyyy', '1750')}
   .back('humans').claimVertices('P19').toCountry().as('countries').select(['humans', 'countries']){it.labelEn}{it.labelEn}

List of humans having occupation writer but not author

edit

This one has 36K+ entries, takes a lot of time. Maybe there's more optimal way to write the same query.

w.benchmark { g.V.has('P106link', 'Q36180').filter{'Q5' in it.P31link && !('Q482980' in it.P106link)}.dump("authors", "wikibaseId", "labelEn") }
 
 w.benchmark { t = []; g.V.has('P106link', 'Q36180').as('w').has('P106link', 'Q482980').aggregate(t).optional('w').except(t).dump("authors", "wikibaseId", "labelEn") }
 

86.017s

List of humans with no date of death

edit

WDQ produces 14431 results.

w.benchmark { g.listOf('Q5').as('humans').claimValues('P569').filter{it.P569value && it.P569value < Date.parse('yyyy', '1880')}.back('humans').filter{!it.out('P570').hasNext()}.dump("undead", "wikibaseId", "labelEn"); }

4763.817 s

too slow, probably needs value index.