BOORU CHARS volume 2023 **completes** an attempt to consolidate and arrange available character-centric
almost SFW anime/CG/game art into localized format suited both for batch processing and visual estimation.
The whole evolved project consists of (in release order):
- [BOORU_CHARS_2021](https://nyaa.iss.one/view/1384820) 1.593.429 images 472 GB topic starter
- [BOORU_CHARS_2015](https://nyaa.iss.one/view/1468367) 463.873 images 148 GB old stuff
- [BOORU_CHARS_2022](https://nyaa.iss.one/view/1547662) 705.467 images 191 GB newcomers
- [BOORU_CHARS_2023](https://nyaa.iss.one/view/1740396) 1.153.513 images 302 GB this one
It is strongly recommended to inspect README's there and - of course - download and seed it.
Almost 4M carefully selected samples are ready (~1.2TB) let's use it for something meaningful !
This release covers
- ~98% newcoming images from composite rips
* [volume V2022C](https://nyaa.iss.one/view/1574093) 05.2022 - 08.2022
* [volume V2022D](https://nyaa.iss.one/view/1634287) 08.2022 - 11.2022 (internal partition 2022)
* [volume V2023A](https://nyaa.iss.one/view/1720018) 12.2022 - 03.2023
* [volume V2023B](https://nyaa.iss.one/view/1727186) 03.2023 - 06.2023
* [volume V2023C](https://nyaa.iss.one/view/1733499) 06.2023 - 08.2023 (internal partition 2023)
- some old imageboards stuff forgotten in BC2015 (internal partition 2016)
- ~20% "the best of" [Dark Pixiv Collection project 202209](https://nyaa.iss.one/view/1626495)
* as "imageboard" pixiv.sfs, long image ID include artist ID post ID and post version
* filtered by minimum size and volume
* semi-automatic NSFW cleanup done
* deduplicated with all other BCs
* included in 2016 partition
Similarly to a whole project :
- files unique identified by (booru + fid) imageboard name and post ID key
verbose file naming used **%booru% - %fid% - %up-to-3-copyrights% ~ %up-to-5-characters% (%up-to-2-artists%)**
- aspect ratio clustered, priorities high to low 7x10 +/-4% ; 3x4 +/-10% ; 1x1 +/-20% ; 3x2 +/-40% ; 2x3 +/-40%
- (as of composite rips) image format JPG-fied and
* sampled 1280px longest side (1024px for 1x1)
* re-mogrified to 94% from 98-100% JPEG quality
- imageboard tags arranged and partially placed inside image EXIF-info
- some general image statistics got with [IMAGE MAGICK](https://imagemagick.org)
- content analisys basicly the same as for BC2022 but with advanced software and models
* [CRAFT text detector](https://github.com/fcakyon/craft-text-detector) used to estimate total size and number of text pieces
* torso components detected with [custom PyTorch models](https://github.com/aperveyev/booru_yolo/tree/main/models)
being built over [Ultralitics YOLOv8](https://github.com/ultralytics/ultralytics) where number of heads was used for folder clustering
This release contains BC2023 by itself :
- **1.153.513** sampled images clustered by aspect ratio and also number of heads detected
(0 heads = letter A, 2 = B, 3+ = C, 1 = letter E in folder name)
ordered and grouped into ~1000/2000-th zip/folders by "attractiveness score function"
- zipped in one archive tab separated texts
* **BC_2023.tsv** file/image related metadata
* **BC_2023_tags.tsv** tags list with Danbooru enrichment 25.250.897 rows
* **BC_2023_yolo.tsv** 3.877.682 detailed results for torso detection
- dedicated **bc_readme.txt** with detailed description and examples
and also huge **crossBOORU catalog** of URLs, tags and other metadata (partitioned by 1-st letter of MD5 hash, zipped)
- **BOORU_*.tsv** 17.733.350 items (not only images) identified by MD5 with 35.033.097 (usually redundant) URLs
- **BOORU_*_TG.tsv** correlated artist / copyright / character tag list 63.900.184 rows
- **BOORU_TG.tsv** 1.014.481 tags registry zipped
- separate **booru_readme.txt** for detailed descrition and examples
Similarly to BC2015 and BC2022:
- simple numerical ranks has been built across clusters of images for each numerical criteria,
so both outlier processing and ranking use only relative ranks or simple functions over it.
- "the worst of" outliers were deleted (rank by rank, ~2% in total)
- "attractiveness score function" finally turned to definition "colorful, textless and least crowded (for 3+ heads)"
Comments - 3
SomaHeir
Shinon71
AlexPUA (uploader)