This release is an open dataset made in line with Danbooru 2018 set.
It covers 1.227.622 thumbnail (512x512px images) from several imageboards combined with supporting metadata.
NOTE THIS IS AN OBSOLETE VERSION OF DATASET, modern version consists of 2021, 2015 2022 and 2023 volumes
- much larger (4.1+M images) and better (sample size 1280/1024px w/o black boxes)
- more tag metadata, better file naming, most valuable tags placed to EXIF
- much more computed metadata (incl. boundboxes)
- clustered and optimized for browsing …
NEVERTHELESS, THIS RELEASE ALSO SUPPORTED. The main features here are:
- good original images technical and visual quality
- width>=900 height>=900 MPixels>=1.2
- most of comixes, primitives, overtexted images manually excluded
- no photo, almost no characterless scenes
- several sources but unique image identification %website% + %id%
- most of original images can be found in torrents (nyaa, rutracker)
- selective regrab of originals possible if source website available
- careful deduplication with relative website priorities, high to low (mostly)
- image file names mostly structured and contains %website% - %id% - %copyright% ~ %characters% (%artist%)
- not completely SFW (a little bit softcore ecchi here and there)
Images timeline covers 10.2016 - 08.2019 densely, earlier period selectively, by “volumes”:
V2019 - 11.2018-08.2019 taken from rip https://nyaa.iss.one/view/1202653
V2018 - period 2017-2018 from rips https://nyaa.iss.one/view/1181364
https://www.acgnx.se/show-cceb3260269b5423cbd7f8d59f2c84531750923b.html
https://nyaa.iss.one/view/771715 and https://nyaa.iss.one/view/513582
and (russian) https://rutracker.org/forum/viewtopic.php?t=5478026
V2016 - till 10.2016 from https://nyaa.iss.one/view/891391
partially used https://nyaa.iss.one/view/750972 and https://nyaa.iss.one/view/875411
V2016W - till 05.2016 converted to wallpapes sizes
https://nyaa.iss.one/view/710893, https://nyaa.iss.one/view/745633
and https://rutracker.org/forum/viewtopic.php?t=5198985
V2018D - remainder from https://nyaa.iss.one/view/1176129 survived after cleanup and deduplication, mostly 2015 and earlier
files renamed according to metadata, white backgrounds for addon-2018 replaced with black ones
Metadata:
- copyrights, characters and artists taglist based on Danbooru tags
- copyrights bundled into Franchises
- characters refers to Franchises
- copyrights and characters refer to Myanimelist entities
- images statistical properties from JPG header and calculated
- entropy (complexity), skewness (darkness)
- colors count and intensity by channels
- color saturation (grayness), edge intensity
- boundbox coordinates and more
- face detection results (Nagadomi) with 3 level of accuracy combined
- complete Safebooru 407.424 posts copyright / characters / artist metadata
- safebooru string tags with Danbooru tag-ids
- Franchises wherever applicable
Software:
- Windows BAT scripts for processing with Image Magick
- Python scripts for some grabbing and processing
This dataset may be used for massive localized image processing and [meta-]data mining, e.g.
- scene scale and composition classification, species recognition algorithms training / estimation
- visual quality and attractiveness ranking / prediction
- any imaginable metadata query with their visualized results on fingertips
Comments - 1
SomaHeir
Thanks! Kinda confusing, but good to have for archive purposes.