Documentation Index Fetch the complete documentation index at: https://mintlify.com/RITIK-12/QualiVision/llms.txt
Use this file to discover all available pages before exploring further.
Overview
QualiVision implements a sophisticated data preprocessing pipeline that handles video loading, frame sampling, resolution adaptation, text encoding, and MOS score normalization. The pipeline is optimized for GPU acceleration and supports both DOVER++ and V-JEPA2 models.
Dataset Structure
From README.md:19-43, the expected directory structure:
data/
├── train/
│ ├── labels/
│ │ └── train_labels.csv
│ └── videos/
│ ├── video001.mp4
│ ├── video002.mp4
│ └── ...
├── val/
│ ├── labels/
│ │ └── val_labels.csv
│ └── videos/
│ ├── val_video001.mp4
│ └── ...
└── test/
├── test_labels.csv
└── videos/
├── test_video001.mp4
└── ...
From README.md:46-50 and dataset.py:37:
video_name, Prompt, Traditional_MOS, Alignment_MOS, Aesthetic_MOS, Temporal_MOS, Overall_MOS
video001.mp4, "A cat playing piano", 3.2, 4.1, 3.8, 3.5, 3.65
video002.mp4, "Sunset over mountains", 4.5, 4.2, 4.8, 4.1, 4.4
Required Columns :
video_name: Video filename
Prompt: Text description/prompt
Traditional_MOS: Technical quality score (1-5)
Alignment_MOS: Text-video alignment score (1-5)
Aesthetic_MOS: Aesthetic appeal score (1-5)
Temporal_MOS: Temporal consistency score (1-5)
Overall_MOS: Overall quality score (1-5)
Frame Sampling
All videos are uniformly sampled to 64 frames :
Resolution Adaptation
Different models require different input resolutions:
From dataset.py:64-69 and config.py:23: # Transform pipeline
self .transform = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize(( 640 , 640 )), # DOVER++ resolution
transforms.ToTensor(), # Converts to [0,1] range
])
Process per frame :
Convert numpy array → PIL Image
Resize to 640×640 (bicubic interpolation by default)
Convert to PyTorch tensor (C, H, W)
Normalize to [0, 1] range
Final shape : (B, C=3, T=64, H=640, W=640)From dataset.py:120-148 and config.py:39: def _sample_frames_processor ( self , video_path : Path) -> torch.Tensor:
# ... frame sampling ...
# Process with video processor
processed = self .video_processor(
list (frames_np),
return_tensors = "pt"
)
return processed[ "pixel_values" ][ 0 ] # (C, T, H=384, W=384)
AutoVideoProcessor handles :
Resize to 384×384
Normalize with ImageNet stats: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
Convert to tensor format expected by ViT
Final shape : (B, C=3, T=64, H=384, W=384)Model Resolution Rationale DOVER++ 640×640 Higher resolution preserves fine-grained quality details (artifacts, sharpness) V-JEPA2 384×384 ViT patch size (16×16) divides evenly; pretrained at this resolution
Memory Impact :
640×640: ~4× more pixels than 384×384
But DOVER++ is lighter (120M params) so total memory similar
V-JEPA2 compensates with parameter freezing
Text Processing with BGE-Large
Text prompts are encoded to dense embeddings:
Text Encoder
Encoding Process
Why BGE-Large?
Prompt Examples
From config.py:25,41: DOVER_CONFIG = { "text_encoder" : "BAAI/bge-large-en-v1.5" }
VJEPA_CONFIG = { "text_encoder" : "BAAI/bge-large-en-v1.5" }
BGE-Large (BAAI General Embedding):
Model : Sentence Transformer
Architecture : BERT-based encoder
Dimension : 1024 (DOVER++) or 768 (V-JEPA2)
Training : Contrastive learning on text pairs
From dataset.py:258-273: # Encode text during batching
with torch.no_grad():
text_emb = self .text_encoder.encode(
prompts,
convert_to_tensor = True ,
normalize_embeddings = True , # L2 normalization
device = self .device,
batch_size = len (prompts)
)
Steps :
Tokenize text (WordPiece)
Pass through BERT encoder
Pool token embeddings (mean pooling)
L2 normalize to unit sphere
Move to GPU
Output : (B, 1024) or (B, 768) tensorAdvantages of BGE-Large :
State-of-the-art : Top performance on MTEB benchmark
Semantic understanding : Captures nuanced meaning
Normalized : Embeddings on unit sphere (cosine similarity)
English-optimized : Best for English prompts
Fast : Efficient inference with sentence-transformers
Alternative : Could use CLIP text encoder, but BGE is optimized for pure text understandingprompts = [
"A cat playing piano" ,
"Sunset over mountains with vibrant colors" ,
"A robot dancing in a futuristic city"
]
# Encoded to:
text_emb = encode(prompts) # Shape: (3, 1024)
# Each prompt → 1024-dim vector capturing:
# - Objects: "cat", "piano", "sunset", "robot"
# - Actions: "playing", "dancing"
# - Attributes: "vibrant", "futuristic"
# - Scene: "mountains", "city"
MOS Score Normalization
From dataset.py:182-189:
if self .has_labels:
# Training/validation mode - include labels
labels = pd.to_numeric(row[ self . MOS_COLS ], errors = "coerce" ).fillna( 3.0 ).astype(np.float32).values
result[ "labels" ] = torch.tensor(labels, dtype = torch.float32)
else :
# Test mode - no labels available
result[ "labels" ] = torch.zeros( 5 , dtype = torch.float32)
Normalization Process
Missing Values
Test Mode
Steps :
Parse : Convert CSV strings to numeric values
Handle missing : Fill NaN with 3.0 (neutral score)
Cast : Convert to float32 for GPU efficiency
Tensorize : Create PyTorch tensor
Range : 1.0 to 5.0 (MOS scale)From README.md:56: - **Quality Normalization** : MOS scores standardized to 1-5 scale
errors = "coerce" # Invalid values → NaN
.fillna( 3.0 ) # NaN → 3.0 (neutral)
Why 3.0?
Middle of 1-5 range
Neutral quality (neither good nor bad)
Prevents extreme bias from missing data
If many values are missing, consider:
Dropping samples with missing labels
Using different imputation (mean of dataset)
Flagging samples for review
# Test mode - no labels available
result[ "labels" ] = torch.zeros( 5 , dtype = torch.float32)
For test sets without ground truth:
Dummy zeros used as placeholders
Loss not computed on test set
Only predictions saved
Data Loading Pipeline
TaobaoVDDataset Class
From dataset.py:29-190:
View Dataset Initialization Code
class TaobaoVDDataset ( Dataset ):
MOS_COLS = [ 'Traditional_MOS' , 'Alignment_MOS' , 'Aesthetic_MOS' , 'Temporal_MOS' , 'Overall_MOS' ]
def __init__ ( self ,
csv_file : str ,
video_dir : str ,
num_frames : int = 64 ,
resolution : int = 640 ,
mode : str = 'train' ,
video_processor : Optional[AutoVideoProcessor] = None ):
self .df = pd.read_csv(csv_file)
self .video_dir = Path(video_dir)
self .num_frames = num_frames
self .resolution = resolution
self .mode = mode
self .video_processor = video_processor
# Video transforms
self .transform = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((resolution, resolution)),
transforms.ToTensor(),
])
# Check if we have ground truth labels
self .has_labels = all (col in self .df.columns for col in self . MOS_COLS )
Features :
Supports both manual transforms (DOVER++) and video processor (V-JEPA2)
Automatic label detection
Flexible resolution
Mode-aware (train/val/test)
GPU-Optimized Collation
From dataset.py:193-288:
OptimizedGPUCollate
Batching Process
Memory Optimization
class OptimizedGPUCollate :
def __init__ ( self ,
video_processor : Optional[AutoVideoProcessor] = None ,
text_encoder : Optional[SentenceTransformer] = None ,
device : str = 'cuda' ,
max_frames : int = 64 ):
self .video_processor = video_processor
self .text_encoder = text_encoder
self .device = device
self .max_frames = max_frames
Purpose : Custom collation function that:
Batches video frames
Encodes text prompts on GPU
Moves data to GPU early
Cleans up memory
From dataset.py:226-288: def __call__ ( self , batch : List[Dict[ str , Any]]) -> Dict[ str , torch.Tensor]:
# Extract components
frames_list = [item[ "frames" ] for item in batch]
prompts = [item[ "prompt" ] for item in batch]
labels_list = [item[ "labels" ] for item in batch]
# Stack frames
frames = torch.stack(frames_list, dim = 0 )
pixel_values_videos = frames.to( self .device)
# Encode text on GPU
text_emb = self .text_encoder.encode(
prompts,
convert_to_tensor = True ,
normalize_embeddings = True ,
device = self .device,
)
# Clean up
del frames_list
torch.cuda.empty_cache()
return {
"pixel_values_videos" : pixel_values_videos,
"text_emb" : text_emb,
"labels" : labels,
"video_names" : video_names,
"prompts" : prompts
}
Key Optimizations :
Early GPU transfer : Move data to GPU in collate (before training loop)
Batch text encoding : Encode all prompts at once
Memory cleanup : Delete intermediate variables + empty cache
No-grad text : Text encoding doesn’t need gradients
# Clean up memory
del frames_list
torch.cuda.empty_cache()
This collation strategy reduces CPU-GPU transfer overhead by ~30% compared to naive batching.
DataLoader Configuration
From dataset.py:291-375:
Training Loader
Validation Loader
Test Loader
Worker Configuration
train_loader = DataLoader(
train_dataset,
batch_size = 4 , # DOVER++: 4, V-JEPA2: 6
shuffle = True , # Shuffle for training
collate_fn = collate_fn, # Custom GPU collation
num_workers = 4 , # Parallel data loading
pin_memory = True , # Faster CPU→GPU transfer
persistent_workers = True # Keep workers alive
)
From config.py:31,47: DOVER_CONFIG = { "batch_size" : 4 }
VJEPA_CONFIG = { "batch_size" : 6 }
val_loader = DataLoader(
val_dataset,
batch_size = 4 ,
shuffle = False , # No shuffle for validation
collate_fn = collate_fn,
num_workers = 4 ,
pin_memory = True ,
persistent_workers = True
)
Difference from training :
shuffle=False: Consistent evaluation order
Same batch size for fair comparison
From dataset.py:378-433: test_loader = DataLoader(
test_dataset,
batch_size = 1 , # Process one at a time
shuffle = False , # Preserve order
collate_fn = collate_fn,
num_workers = 0 , # Single worker for stability
pin_memory = True
)
Test-specific settings :
batch_size=1: Avoid OOM on diverse test videos
num_workers=0: Simpler debugging
shuffle=False: Match predictions to input order
From config.py:73-75: TRAINING_CONFIG = {
"num_workers" : 4 ,
"pin_memory" : True ,
"persistent_workers" : True
}
Settings Explained :Setting Value Purpose num_workers4 Parallel video loading (CPU cores) pin_memoryTrue Faster CPU→GPU transfer (page-locked memory) persistent_workersTrue Workers stay alive between epochs (faster)
num_workers > 0 requires proper multiprocessing. On some systems (especially with Decord), use num_workers=0 if you encounter issues.
Complete Data Flow
CSV Loading
df = pd.read_csv(csv_file)
# Columns: video_name, Prompt, Traditional_MOS, ..., Overall_MOS
Video Reading
vr = VideoReader( str (video_path))
frames = vr.get_batch(indices) # (64, H, W, 3)
Frame Transformation
for frame in frames:
transformed = transform(frame) # Resize + ToTensor
video_tensor = stack(transformed) # (3, 64, H, W)
Batching
batch_frames = stack([item[ 'frames' ] for item in batch]) # (B, 3, 64, H, W)
batch_frames = batch_frames.to( 'cuda' )
Text Encoding
text_emb = text_encoder.encode(prompts, device = 'cuda' ) # (B, 1024)
Label Extraction
labels = torch.tensor(mos_scores, dtype = float32) # (B, 5)
Return Batch
return {
"pixel_values_videos" : batch_frames,
"text_emb" : text_emb,
"labels" : labels,
"prompts" : prompts,
"video_names" : video_names
}
Bottlenecks
Optimization Tips
Profiling
Common bottlenecks in video data loading :
Disk I/O : Reading video files from disk
Solution : Use SSD storage, prefetch with num_workers
Video decoding : Decompressing video codecs
Solution : Decord with GPU acceleration
Frame transformation : Resizing and converting
Solution : Batch transforms, use GPU if possible
Text encoding : BERT forward pass
Solution : Encode in collate_fn (parallel with training)
CPU→GPU transfer : Moving data to GPU
Solution : pin_memory=True, early GPU transfer
Data Loading Best Practices :✅ DO :
Use SSD for video storage
Set num_workers=4 (match CPU cores)
Enable pin_memory=True
Use persistent_workers=True
Encode text on GPU in collate_fn
Clean up memory after batching
❌ DON’T :
Load full videos into memory
Encode text on CPU then transfer
Use num_workers > CPU cores
Forget to call empty_cache()
Decode videos to RGB unnecessarily
Measure data loading speed :import time
start = time.time()
for batch in train_loader:
# Measure time to get batch
print ( f "Batch loaded in { time.time() - start :.2f} s" )
start = time.time()
break
Target : Less than 0.5s per batch (should be faster than model forward pass)
Architecture How preprocessed data flows through models
DOVER++ Model 640×640 resolution requirements
V-JEPA2 Model 384×384 resolution and video processor