[Booru imageboards](https://www.kaggle.com/printcraft/eshuushuu-tags-v1) contains huge count of images and are replenished constantly by end-users and batch crosspostings.
There is no strict quality control but general (booru dependent) moderational rules and community self-cleanup.
As a result raw imageboards content and datasets over it [such as Danbooru 2020](https://www.gwern.net/Danbooru2020) are somewhat "dirty"
and a lot of preprocessing required to exclude some evident "outliers".
The [BOORU CHARS 2021 torrent](https://nyaa.iss.one/view/1384820) aimed to be comparable size but much cleaner from esthetical and technical points of view with the same
amount of community metadata (tags, user ratings etc) and also with technical metadata about images and image content.
**THIS IS THE ADDON to BOORU CHARS 2021** with no duplicates and minumum similarities for pre-2016 art .
The practical task was an almost-automatic cleanup ~510k images of "cluttered basement" :
- to throw out ~10% "worst outliers"
- to rank the rest of pictures with "attractiveness score function"
The task is generally subjective, but there are some esthetical "rules of thumb" projected on numeric criterias
which can be estimated with image processing workflow (described in the README):
- format unification, JPEG quality arrangement, deduplication
- renaming to **%booru% - %fid% - %up_to_3_copyrights% ~ %up_to_5_characters% (%up_to_2_artists%)**
- clustering by aspect ratio 7x10 +/-4% >> 3x4 +/-10% >> 1x1 +/-20% >> 3x2 +/-40% >> 2x3 +/-40%
- downsize to samples (max size 1280px, 1024px for 1x1), more deduplication
- general image statistics calculation (complexity, colorfulness etc)
- DEEP CONTENT ANALYSIS (text detection, heads and other torso components detection, scene segmentation)
Then outliers was extracted with several "quality functions" and "attractiveness score" applied to the remainder.
Unlike BC2021 some characterless scenes, cosplay, uncensored boobs and other off-topics may left in release (no manual review here).
**This release contains:**
- **463.873 sampled images** with metadata
* clustered by aspect ratio and also number of heads (0,1,2,3+) detected
* ordered and grouped into 1000-th zip/folders by "attractiveness score function"
* with verbose file name and EXIF info
- detailed results for detection algorythms and also full tags list
- python script and data to visualize boundbox-like detections
- several XLS with summary queries results
- sample code (commandline, python, PL/SQL) not "ready to use" but key building blocks
NOTE the substantial share of images came here from releases [zerochan 2014](https://nyaa.iss.one/view/1336359)
[Safebooru pages](https://nyaa.iss.one/view/845106) and [Safebooru 1280](https://nyaa.iss.one/view/719463) [Sankaku 2015](https://nyaa.iss.one/view/750972) and ancient [e-shuushuu](https://nyaa.iss.one/view/513582)
Comments - 1
SomaHeir