km.azerttyu.net

Accueil > Du km au texte > informatique > Conversion massive d’un jeu de caractéres à un autre

Conversion massive d’un jeu de caractéres à un autre

jeudi 12 février 2015, par km

Lors de la reprise d’un projet on obtient souvent un empilement de fichiers et une liste longue comme le bras de problèmes. Parmi les points énervants nous trouvons : mais c’est quoi ce charset de .... !!!

Préambule

Historiquement les fichiers informatiques sont stockés selon un jeu de caractères. Ce jeu de caractères permet d’associer une lettre affichée/saisie à son alter-ego numérique. Vu que le monde n’est pas uniforme, selon les uns et les autres, le jeu de caractères de référence diffère. En France par exemple on utilise souvent latin1 autrement nommée iso-8859-1 ou iso-8859-15. On peut trouver aussi le vieux format ANSI qui ne connaît pas les accents.

Pour uniformiser le travail, entre les gens, un jeu de caractères plus large a été proposé utf-8. Il est maintenant tellement répandu que c’est une norme respectée et à respecter.

Revenons à nos moutons

Donc maintenant nous voulons remettre tous les fichiers d’aplomb et respecter le format UTF-8. Parmi les différents scripts trouvés par ci par là, celui proposé par lexo.ch me semble le plus complet.

Voici le code utilisé, de mémoire il est adapté à la marge pour mes besoins.

  1. #!/bin/bash
  2.  
  3. # Created by LEXO, http://www.lexo.ch
  4. # Version 1.0
  5. #
  6. # This bash script converts all files from within a given directory from any charset to UTF-8 recursively
  7. # It takes track of those files that cannot be converted automatically. Usually this happens when the original charset
  8. # cannot be recognized. In that case you should load the corresponding file into a development editor like Netbeans
  9. # or Komodo and apply the UTF-8 charset manually.
  10. #
  11. # This is free software. Use and distribute but do it at your own risk.
  12. # We will not take any responsibilities for failures and do not provide any support.
  13.  
  14. #checking Parameters
  15. if [ ! -n "$1" ] ; then
  16. echo "You did not supply any directory at the command line."
  17. echo "You need to provide the path to the directory that contains the files which you want to be converted"
  18. echo ""
  19. echo "Example: $0 /path/to/directory"
  20. echo ""
  21. echo "Important hint: You should not run this script from within the same directory where the files are stored"
  22. echo "that you want to convert right now."
  23. exit
  24. fi
  25.  
  26. # This array contains file extensions that need to be checked no matter if the filetype is binary or not.
  27. # Reason: Sometimes it happens that .htm(l), .php, .tpl files etc. have a binary charset type. This script
  28. # does not convert binary file types into utf-8 because it might destroy your data. So we need to include these file types
  29. # into the conversion system manually to tell the conversion that binary files with these special extensions may be converted anyway.
  30. filestoconvert=(htm html php txt tpl asp css js xml sh)
  31.  
  32. # define colors
  33. # default color
  34. reset="\033[0;00m"
  35. # Successful conversion (green)
  36. success="\033[1;32m"
  37. # No conversion needed (blue)
  38. noconversion="\033[1;34m"
  39. # file skipped because it's not mentioned in the filestoconvert array (white)
  40. fileskipped="\033[1;37m"
  41. # files that could not be converted aka error (red)
  42. fileconverterror="\033[1;31m"
  43.  
  44. ## function to convert all files in a directory recusrively
  45. function convert {
  46. #clear screen first
  47. clear
  48.  
  49. dir=$1
  50.  
  51. # Get a recursive file list
  52. files=(`find $dir -type f ! -path "*/.git/*"`);
  53. fileerrors=""
  54.  
  55. #loop counter
  56. i=0
  57.  
  58. find "$dir" -type f ! -path "*/.git/*"|while read inputfile
  59. do
  60. if [ -f "$inputfile" ] ; then
  61. charset="$(file -bi "$inputfile"|awk -F "=" '{print $2}')"
  62. if [ "$charset" != "utf-8" ]; then
  63. #if file extension is in filestoconvert variable the file will always be converted
  64. filename=$(basename "$inputfile")
  65. extension="${filename##*.}"
  66. # If the current file has not an extension that is listed in the array $filestoconvert the current file is being skipped (no conversion occurs)
  67. if in_array $extension "${filestoconvert[@]}" ; then
  68. # create a tempfile and remember all of the current file permissions to be able to reapply those to the new converted file after conversion
  69. tmp=$(mktemp)
  70. owner=`ls -l "$inputfile" | awk '{ print $3 }'`
  71. group=`ls -l "$inputfile" | awk '{ print $4 }'`
  72. octalpermission=$( stat --format=%a "$inputfile" )
  73. echo -e "$success $inputfile\t$charset\t->\tUTF-8 $reset"
  74. iconv -f "$charset" -t utf8 "$inputfile" -o $tmp &>2
  75. RETVAL=$?
  76. if [ $RETVAL > 0 ] ; then
  77. # There was an error converting the file. Remember this and inform the user about the file not being converted at the end of the conversion process.
  78. fileerrors="$fileerrors\n$inputfile"
  79. fi
  80. mv "$tmp" "$inputfile"
  81. #re-apply previous file permissions as well as user and group settings
  82. chown $owner:$group "$inputfile"
  83. chmod $octalpermission "$inputfile"
  84. else
  85. echo -e "$fileskipped $inputfile\t$charset\t->\tSkipped because its extension (.$extension) is not listed in the 'filestoconvert' array. $reset"
  86. fi
  87. else
  88. echo -e "$noconversion $inputfile\t$charset\t->\tNo conversion needed (file is already UTF-8) $reset"
  89. fi
  90. fi
  91. (( ++i ))
  92. done
  93. echo -e "$success Done! $reset"
  94. echo -e ""
  95. echo -e ""
  96. if [ ! $fileerrors == "" ]; then
  97. echo -e "The following files had errors (origin charset not recognized) and need to be converted manually (e.g. by opening the file in an editor (IDE) like Komodo or Netbeans:"
  98. echo -e $fileconverterror$fileerrors$reset
  99. fi
  100. exit 0
  101. } #end function convert()
  102.  
  103. # Check if a value exists in an array
  104. # @param $1 mixed Needle
  105. # @param $2 array Haystack
  106. # @return Success (0) if value exists, Failure (1) otherwise} #end function in_array()
  107. # Usage: in_array "$needle" "${haystack[@]}"
  108. in_array() {
  109. local needle=$1
  110. local hay=$2
  111. shift
  112. for hay; do
  113. # echo "Hay: $hay , Needle: $needle"
  114. [[ $hay == $needle ]] && return 0
  115. done
  116. return 1
  117. } #end function in_array
  118.  
  119. #start conversion
  120. convert $1

Télécharger

Un projet normalisé

  1. ./convert-to-utf8.sh /var/www/websitepourri/

Et là un répertoire tout propre respectant le charset UTF-8


Un message, un commentaire ?

Qui êtes-vous ?
Votre message
  • Pour créer des paragraphes, laissez simplement des lignes vides.