Split large directory into subdirectories

Question:

I have a directory with about 2.5 million files and is over 70 GB.

I want to split this into subdirectories, each with 1000 files in them.

Here’s the command I’ve tried using:

i=0; for f in *; do d=dir_$(printf %03d $((i/1000+1))); mkdir -p $d; mv "$f" $d; let i++; done

That command works for me on a small scale, but I can leave it running for hours on this directory and it doesn’t seem to do anything.

I’m open for doing this in any way via command line: perl, python, etc. Just whatever way would be the fastest to get this done…

Asked By: Edward

||

Answers:

I suspect that if you checked, you’d noticed your program was actually moving the files, albeit really slowly. Launching a program is rather expensive (at least compared to making a system call), and you do so three or four times per file! As such, the following should be much faster:

perl -e'
   my $base_dir_qfn = ".";
   my $i = 0;
   my $dir;
   opendir(my $dh, $base_dir_qfn)
      or die("Can'''t open dir "$base_dir_qfn": $!n");

   while (defined( my $fn = readdir($dh) )) {
      next if $fn =~ /^(?:..?|dir_d+)z/;

      my $qfn = "$base_dir_qfn/$fn";

      if ($i % 1000 == 0) {
         $dir_qfn = sprintf("%s/dir_%03d", $base_dir_qfn, int($i/1000)+1);
         mkdir($dir_qfn)
            or die("Can'''t make directory "$dir_qfn": $!n");
      }

      rename($qfn, "$dir_qfn/$fn")
         or do {
            warn("Can'''t move "$qfn" into "$dir_qfn": $!n");
            next;
         };

      ++$i;
   }
'
Answered By: ikegami

if the directory is not under use, I suggest the following

find . -maxdepth 1 -type f | split -l 1000 -d -a 5 

this will create n number of files named x00000 – x02500 (just to make sure 5 digits although 4 will work too). You can then move the 1000 files listed in each file to a corresponding directory.

perhaps set -o noclobber to eliminate risk of overrides in case of name clash.

to move the files, it’s easier to use brace expansion to iterate over file names

for c in x{00000..02500}; 
do d="d$c"; 
   mkdir $d; 
   cat $c | xargs -I f mv f $d; 
done 
Answered By: karakfa

I would use the following from the command line:

find . -maxdepth 1 -type f |split -l 1000
for i in `ls x*`
do 
   mkdir dir$i
   mv `cat $i` dir$i& 2>/dev/null
done

Key is the “&” which threads out each mv statement.

Thanks to karakfa for the split idea.

Answered By: Sean Paine

Note: ikegami’s helpful Perl-based answer is the way to go – it performs the entire operation in a single process and is therefore much faster than the Bash + standard utilities solution below.


A bash-based solution needs to avoid loops in which external utilities are called order to perform reasonably.
Your own solution calls two external utilities and creates a subshell in each loop iteration, which means that you’ll end up creating about 7.5 million processes(!) in total.

The following solution avoids loops, but, given the sheer number of input files, will still take quite a while to complete (you’ll end up creating 4 processes for every 1000 input files, i.e., ca. 10,000 processes in total):

printf '%s' * | xargs -0 -n 1000 bash -O nullglob -c '
  dirs=( dir_*/ )
  dir=dir_$(printf %04s $(( 1 + ${#dirs[@]} )))
  mkdir "$dir"; mv "$@" "$dir"' -
  • printf '%s' * prints a NUL-separated list of all files in the dir.
    • Note that since printf is a Bash builtin rather than an external utility, the max. command-line length as reported by getconf ARG_MAX does not apply.
  • xargs -0 -n 1000 invokes the specified command with chunks of 1000 input filenames.

    • Note that xargs -0 is nonstandard, but supported on both Linux and BSD/OSX.
    • Using NUL-separated input robustly passes filenames without fear of inadvertently splitting them into multiple parts, and even works with filenames with embedded newlines (though such filenames are very rare).
  • bash -O nullglob -c executes the specified command string with option nullglob turned on, which means that a globbing pattern that matches nothing will expand to the empty string.

    • The command string counts the output directories created so far, so as to determine the name of the next output dir with the next higher index, creates the next output dir, and moves the current batch of (up to) 1000 files there.
Answered By: mklement0

This is probably slower than a Perl program (1 minute for 10.000 files) but it should work with any POSIX compliant shell.

#! /bin/sh
nd=0
nf=0
/bin/ls | 
while read file;
do
  case $(expr $nf % 10) in
  0)
    nd=$(/usr/bin/expr $nd + 1)
    dir=$(printf "dir_%04d" $nd)
    mkdir $dir
    ;;
  esac
  mv "$file" "$dir/$file"
  nf=$(/usr/bin/expr $nf + 1)

done

With bash, you can use arithmetic expansion $((…)).

And of course this idea can be improved by using xargs – should not take longer than ~ 45 sec for 2.5 million files.

nd=0
ls | xargs -L 1000 echo | 
while read cmd;
do
  nd=$((nd+1))
  dir=$(printf "dir_%04d" $nd)
  mkdir $dir
  mv $cmd $dir
done
Answered By: laune

Moving files around is always a challenge. IMHO all the solutions presented so far do not work and have some risk of destroying your files. This may be because the challenge sounds simple, but there is a lot to consider and to test when implementing it.

We must also not underestimate the efficiency of the solution as we are potentially handling a (very) large number of files.

Here is script carefully & intensively tested with own files. But of course use at your own risk!

This solution:

  • is safe with filenames that contain spaces.
  • does not use xargs -L because this will easily result in "Argument list too long" errors
  • is based on Bash 4 and does not depend on awk, sed, tr etc.
  • is very fast when there are no files to move.

Here is the code:

if [[ "${BASH_VERSINFO[0]}" -lt 4 ]]; then
  echo "$(basename "$0") requires Bash 4+"
  exit -1
fi >&2

opt_dir=${1:-.}
opt_max=1000

readarray files <<< "$(find "$opt_dir" -maxdepth 1 -mindepth 1 -type f)"
moved=0 dirnum=0 dirname=''

for ((i=0; i < ${#files[@]}; ++i))
do
  if [[ $((i % opt_max)) == 0 ]]; then
    ((dirnum++))
    dirname="$opt_dir/$(printf "%02d" $dirnum)"
  fi
  # chops the LF printed by "find"
  file=${files[$i]::-1}
  if [[ -n $file ]]; then
    [[ -d $dirname ]] || mkdir -v "$dirname" || exit
    mv "$file" "$dirname" || exit
    ((moved++))
  fi
done

echo "moved $moved file(s)"

For example, save this as split_directory.sh. Now let’s assume you have 2001 files in some/dir:

 $ split_directory.sh some/dir
mkdir: created directory some/dir/01
mkdir: created directory some/dir/02
mkdir: created directory some/03
moved 2001 file(s)

Now some/dir contains 3 directories and 0 files:

  • some/dir/01 and some/dir/02 each contain 1000 files
  • some/dir/03 contain 1 file

Calling the script again on the same directory is safe and fast:

 $ split_directory.sh some/dir
moved 0 file(s)

Finally, let’s take a look at the special case where we call the script on one of the generated directories:

 $ time split_directory.sh some/dir/01
mkdir: created directory 'some/dir/01/01'
moved 1000 file(s)

real    0m19.265s
user    0m4.462s
sys     0m11.184s
 $ split_directory.sh some/dir/01
moved 0 file(s)

real    0m0.140s
user    0m0.015s
sys     0m0.123s

Note that this test ran on a fairly slow, veteran computer.

Good luck 🙂

Answered By: Andreas Spindler