File Magic Number

Aug 26th, 2014 12:59 pm

起因

最近在做一个图片相关的项目,涉及到图片的处理,在切换到新的图片服务器之后,原始的图片裁剪接口失败,报错的原因是:

[2014-08-25 22:05:45 ERROR XXXX.util.ImageUtil:175] cutJPG is error,input:/home/q/www/crm.fs.qunar.com/mfs-fs/image/2f/fd/new_2ffd34d0edf5dcc7807107dd9e745d2b.jpg
javax.imageio.IIOException: Not a JPEG file: starts with 0x89 0x50
at com.sun.imageio.plugins.jpeg.JPEGImageReader.readImageHeader(Native Method)

很奇怪,错误原因是说new_2ffd34d0edf5dcc7807107dd9e745d2b.jpg这个图片不是jpg格式的图片.

问题

老的裁剪接口使用的是java自带的javax.imageio与java.awt.image包下工具,如ImageReader, ImageWritter,BufferedImage等. 错误发生的地方的代码逻辑:

//根据文件后缀获取Reader
Iterator<ImageReader> it = ImageIO.getImageReadersByFormatName(StringUtil.getFileSuffix(input));
ImageReader reader = it.next();
...
//获取图片读取需要的参数
ImageReadParam param = reader.getDefaultReadParam();
...
//读取图片
BufferedImage bi = reader.read(0, param);

错误显示在最后上面代码的最后一行读取文件的时候失败,更下一步的原因是JPEGImageReader的read方法:

at com.sun.imageio.plugins.jpeg.JPEGImageReader.readImageHeader(Native Method) ~[na:1.7.0_45]
at com.sun.imageio.plugins.jpeg.JPEGImageReader.readNativeHeader(JPEGImageReader.java:604) ~[na:1.7.0_45]
at com.sun.imageio.plugins.jpeg.JPEGImageReader.checkTablesOnly(JPEGImageReader.java:342) ~[na:1.7.0_45]
at com.sun.imageio.plugins.jpeg.JPEGImageReader.gotoImage(JPEGImageReader.java:476) ~[na:1.7.0_45]
at com.sun.imageio.plugins.jpeg.JPEGImageReader.readHeader(JPEGImageReader.java:597) ~[na:1.7.0_45]
at com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1054) ~[na:1.7.0_45]
at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1034) ~[na:1.7.0_45]

ImageReader有很多的具体实现,包括PNGImageReader, BMPImageReader, GIFImageReader等,这里的原因就是一个png的图片后缀名改为jpg,然后Reader的时候会根据后缀(.jpg)来选择实际的Reader.

那么JPEGImageReader怎么知道这个图片不是一个jpg的图片呢?这就涉及到我们这里讲到的File magic number.

什么File magic number

Magic numbers are the first bits of a file which uniquely identify the type of file. This makes programming easier because complicated file structures need not be searched in order to identify the file type.

简单的说File magic number就是用文件的一些bit位来唯一标志这个文件的类型.

下面是三个图片文件(jpg, png, gif)的头部:

xiaobaoqiu@xiaobaoqiu:~/Picture$ hexdump -C 1.jpg -n 10
00000000  ff d8 ff e0 00 10 4a 46  49 46                    |......JFIF|
0000000a
xiaobaoqiu@xiaobaoqiu:~/Picture$ hexdump -C ticket.png -n 10
00000000  89 50 4e 47 0d 0a 1a 0a  00 00                    |.PNG......|
0000000a
xiaobaoqiu@xiaobaoqiu:~/Picture$ hexdump -C rtree.gif -n 10
00000000  47 49 46 38 39 61 37 02  8b 02                    |GIF89a7...|
0000000a

图片1.jpg中的开始的ff d8表明只是一个JPG文件,随后的ff e0表明这是一种JFIF结构类型(参考:http://blog.csdn.net/kickxxx/article/details/8173332).

GIF文件的头部包括 : (47 49 46 38 39 61)和(47 49 46 38 37 61)这两种;

PNG文件头部 : 89 50 4E 47 0D 0A 1A 0A

典型文件的File magic number

总结一些典型文件的File magic number:

图片

文件类型	典型扩展名	Hex字符	Ascii digits
Bitmap format	.bmp	42 4d	BM
FITS format	.fits	53 49 4d 50 4c 45	SIMPLE
GIF format	.gif	47 49 46 38	GIF8
Graphics Kernel System	.gks	47 4b 53 4d	GKSM
IRIS rgb format	.rgb	01 da	..
ITC (CMU WM) format	.itc	f1 00 40 bb	….
JPEG File Interchange Format	.jpg	ff d8 ff e0	….
NIFF (Navy TIFF)	.nif	49 49 4e 31	IIN1
PM format	.pm	56 49 45 57	VIEW
PNG format	.png	89 50 4e 47	.PNG
Postscript format	.[e]ps	25 21	%!
Sun Rasterfile	.ras	59 a6 6a 95	Y.j.
Targa format	.tga	xx xx xx	…
TIFF format (Motorola - big endian)	.tif	4d 4d 00 2a	MM.*
TIFF format (Intel - little endian)	.tif	49 49 2a 00	II*.
X11 Bitmap format	.xbm	xx xx
XCF Gimp file structure	.xcf	67 69 6d 70 20 78 63 66 20 76	gimp xcf
Xfig format	.fig	23 46 49 47	#FIG
XPM format	.xpm	2f 2a 20 58 50 4d 20 2a 2f	/ XPM /

压缩文件

文件类型	典型扩展名	Hex字符	Ascii digits
Bzip	.bz	42 5a	BZ
Compress	.Z	1f 9d	..
gzip format	.gz	1f 8b	..
pkzip format	.zip	50 4b 03 04	PK..

可执行文件

文件类型	典型扩展名	Hex字符	Ascii digits
MS-DOS, OS/2 or MS Windows		4d 5a	MZ
Unix elf		7f 45 4c 46	.ELF

并查集

Aug 13th, 2014 11:49 pm

简介

并查集是一种非常简单的数据结构，它主要涉及两个基本操作，分别为：

1. 合并两个不相交集合;
2. 判断两个元素是否属于同一个集合;

实现

实现上,通常用一个数组实现,如100个元素可以用大小为100的数组, 数组的内容存储节点的父亲节点的下标.

1.朴素实现

public class UnionFind extends AbstractUnionFind {

    /**
     * ctor
     * @param size
     */
    public UnionFind(int size) {
        assert(size > 0);

        value = new int[size];
        init();
    }

    /**
     * 初始化数据
     */
    private void init() {
        assert (value != null && value.length > 0);
        for (int i = 0; i < value.length; i++) {
            value[i] = DEFAULT_PARENT;  //DEFAULT_PARENT为-1
        }
    }

    /**
     * 合并根节点为rx 和 根节点为ry的树
     * 这里rx 和 ry为下标
     *
     * @param rx
     * @param ry
     */
    @Override
    public void Union(int rx, int ry) {
        assert (rx >= 0 && rx < value.length);
        assert (ry >= 0 && ry < value.length);

        //找到两个节点的根节点
        int px = Find(rx);
        int py = Find(ry);

        //将px设置为py的子树
        value[px] = py;
    }

    /**
     * 查找节点x的根节点
     *
     * @param x
     * @return
     */
    @Override
    public int Find(int x) {
        while(isRoot(x) == false) {
            x = value[x];
        }

        return x;
    }

    /**
     * 判断节点是否为根节点
     *
     * @param rx
     * @return
     */
    private boolean isRoot(int rx) {
        assert (rx >= 0 && rx < value.length);

        return value[rx] == DEFAULT_PARENT; //DEFAULT_PARENT为-1
    }
}

Find的时间复杂度和树的高度成正比,最坏的情况下为所有节点形成一个链表, 这时候的复杂度为O(N);

Unoin时间需要调用Find,所以Union的最坏时间复杂度也为O(N);

2.优化1:根据树高度Union

朴素实现的容易出现所有节点(或者很多节点)形成一个链表,导致树的高度很大,进而导致Find的效率低.

我们可以简单的改变Union的策略,将小数(高度小)合并作为大树(高度大)的子树,高度也称之为树的秩.

实现这个策略,通常实现都是使用额外的一个数组来保存树的秩,在<数据结构与算法分析:C语言描述>这本书给出了一个巧妙的实现,我们value[rx]之前是用-1表示其为跟节点,这个-1完全是无意义的,我们可以用这个空间来保存树的高度,同时为了避免和下表混淆,我们保存高度的负值.

实现和朴实并查集的不同的地方:

/**
 * 合并根节点为rx 和 根节点为ry的树
 * 这里rx 和 ry为下标
 *
 * @param rx
 * @param ry
 */
@Override
public void Union(int rx, int ry) {
    assert (rx >= 0 && rx < value.length);
    assert (ry >= 0 && ry < value.length);

    //找到两个节点的根节点
    int px = Find(rx);
    int py = Find(ry);

    // 将px设置为py的子树
    if (value[px] < value[py]) { // px为根节点的树高度更大, 应该将py树作为px的子树
        value[py] = px;
    } else if (value[py] < value[px]) {
        value[px] = py;
    } else { // 高度相同, 最终树的高度+1
        value[py] = px;
        value[px]--; // 数的高度+1
    }
}

/**
 * 判断节点是否为根节点
 *
 * @param rx
 * @return
 */
private boolean isRoot(int rx) {
    assert (rx >= 0 && rx < value.length);

    return value[rx] < 0;   //小于0的都表示是跟节点
}

根据树高度Union基本能够使得树的高度维持在log(n)的级别,因此Find和Union的时间复杂度均为log(n);

3.优化2:路径压缩

路径压缩是指,我们在Find的过程中,从最底层节点依次遍历到根节点,期间我们可以将路径上的所以节点的父亲节点设置为根节点.

和2的不同之处:

```java /* * 查找节点x的根节点 * 从最底端往上便利的时候, 可以顺带更改每个节点的父亲节点为跟节点, 使得树的高度降低 * * @param x * @return / @Override public int Find(int x) { while(value[x] >= 0) { //非根节点 value[x] = Find(value[x]); //设置当前节点的父亲节点为根节点 }

return x;

} 不用担心递归的深度,因为从最开始就执行路径压缩的话,我们可以认为数的高度一直是1.

这是的树的高度基本为1,因此Unoin和Find的时间复杂度都接近O(1);

典型应用

并查集是高效简洁数据结构的典型代表.其典型应用包括:

无向图的连通分量个数

Kruskar算法求最小生成树

二叉堆

Jul 18th, 2014 6:03 pm

二叉堆(小顶堆)保证孩子节点大于等于父亲节点,同时它又具有完全二叉树(除了底层,其他层是满的)的性质.

一个小顶堆如下:

    1
  /   \
 2     3
/ \   / \
   4   5  6  7
  / \ / \
 8  9 10 11    

二叉堆是专门为取出最大或最小节点而设计点数据结构，这种数据结构在查找一般元素方面性能和一般数组是没有多大区别的。

二叉堆在取出最大或最最小值的性能表现是O(1),取出操作完成之后，二叉堆需要一次整形操作，以便得到下一个最值，这个操作复杂度O(logn)。

但是二叉堆也有一个缺点，就是二叉堆对存储在内存中的数据操作太过分散，这导致了二叉堆在cpu高速缓存的利用与内存击中率上面表现不是很好，这也是一个二叉堆理想操作时间所需要付出的代价。

二叉堆用用数组实现很方便,比较经典的讲解见<算法导论>,这里以小顶堆为例子. 数组实现时候的一些属性,便于代码实现: (1).父亲节点的下标为i,则左孩子节点下标2*i+1, 右孩子节点下标2*i+2;同理,下标为i的节点的父亲节点下标为(i-1)/2. (2).n个节点的二叉堆,非叶子节点的数目为floor(n/2),如7个节点,则有3个非叶子节点. (3).高度为h的完全二叉树,包含2^h到2^h+1-1个节点,即n个节点的二叉堆,高度为logn;

实现

二叉堆的实现主要涉及到两个操作,siftUp和siftDown,siftUp表示将第i个位置的元素向上调整,保证到达根节点的路径上都满足二叉堆的性质;siftDown将第i个位置的元素向下调整,保证i为根节点的子树为一个二叉堆.

下面是我的一个简单实现:

import java.util.Arrays;

/**
 * 二叉堆 这里数据类型为Integer, 且堆为小顶堆
 * 几个重要的性质(具有完全二叉树的性质):
 * (1).父亲节点的下标为i,则左孩子节点下标2*i+1, 右孩子节点下标2*i+2;同理,下标为i的节点的父亲节点下标为(i-1)/2;
 * (2).n个节点的二叉堆,非叶子节点的数目为floor(n/2),如7个节点,则有3个非叶子节点;
 * (3).高度为h的完全二叉树,包含2^h到2^(h+1)-1个节点,即n个节点的二叉堆,高度为logn
 * 
 * @Author: baoqiu.xiao@qunar.com Date: 14-7-17 Time: 下午3:45
 * @Version: \$Id$
 */
public class PQueue {

    /**
     * 默认大小
     */
    private static final int DEFAULT_CAPACITY = 16;

    /**
     * 数据
     */
    private transient Integer[] queue;

    /**
     * 大小
     */
    private int size = 0;

    /**
     * ctor
     */
    public PQueue() {
        this(DEFAULT_CAPACITY);
    }

    public PQueue(int initialCapacity) {
        if (initialCapacity < 1)
            throw new IllegalArgumentException();
        this.queue = new Integer[initialCapacity];
    }

    public PQueue(Integer[] array) {
        this(array.length);
        queue = Arrays.copyOf(array, array.length);
        size = array.length;

        heapify();
    }

    /**
     * 堆内元素个数, O(1)
     * 
     * @return
     */
    public int size() {
        return size;
    }

    public boolean isEmpty() {
        return size == 0;
    }

    /**
     * 取堆顶元素(这里为最小元素),但不移除堆顶元素, O(1)
     * 
     * @return
     */
    public Integer peek() {
        if (size == 0)
            return null;

        return queue[0];
    }

    /**
     * 弹出堆顶元素并返回, O(logn)
     * 
     * @return
     */
    public Integer pop() {
        if (size == 0)
            return null;

        Integer result = queue[0];
        queue[0] = queue[--size]; // 数组尾部元素放到堆顶
        queue[size] = null;
        if (size > 0) {
            shiftDown(0);
        }

        return result;
    }

    /**
     * 往堆里面插入一个元素, O(logn)
     * 
     * @param newValue
     */
    public boolean add(Integer newValue) {
        if (newValue == null)
            throw new NullPointerException();

        // 需要增长内存
        int oldSize = size;
        if (oldSize >= queue.length) {
            grow(oldSize + 1);
        }

        size = oldSize + 1;
        queue[oldSize] = newValue;
        if (oldSize > 0) {
            shiftUp(oldSize);
        }

        return true;
    }

    /**
     * 内存增长,保证能容纳 minCapacity 个元素
     * 
     * @param minCapacity
     */
    private void grow(int minCapacity) {
        if (minCapacity < 0)
            throw new OutOfMemoryError();

        // 这里成倍增长
        int oldCapacity = queue.length;
        int newCapacity = oldCapacity << 1;

        // 校验
        if (newCapacity < 0)
            newCapacity = Integer.MAX_VALUE;
        if (newCapacity < minCapacity)
            newCapacity = minCapacity;

        // 数据拷贝
        queue = Arrays.copyOf(this.queue, newCapacity);
    }

    /**
     * 将数组最小堆化 从最后一个非叶子节点,依次往前调整,使得每个子树都满足堆的性质
     * 根据下标为i的节点的父亲节点下标为(i-1)/2, 最大下标为size-1, 因此非根节点的最大下标为(size-2)/2, 即size/2 -1
     *
     * 直观是O(nlogn), 记得算法导论分析的均摊时间应该是O(n), 忘了怎么分析的
     */
    private void heapify() {
        for (int i = (size >>> 1) - 1; i >= 0; i--)
            shiftDown(i);
    }

    /**
     * 向下调整下标为index的节点为根节点的子树, 使其满足堆的性质
     * 父亲节点的下标为i,则左孩子节点下标2*i+1, 右孩子节点下标2*i+2
     * O(logn)
     * 
     * @param index
     */
    private void shiftDown(int index) {
        //到达叶子节点了
        if(index > ((size >>> 1) - 1) )
            return;

        //左右孩子中较小的节点candidate
        int left = (index << 1) + 1, right = left + 1;
        int candidate = left;
        if(right < size && queue[right] < queue[left]){
            candidate = right;
        }

        //父节点大于子节点, 则交换父子节点, 并递归调整candide为根节点的子树
        if (queue[index] > queue[candidate]) {
            swap(index, candidate);
            shiftDown(candidate);
        }
    }

    /**
     * 向上调整下标为index的节点的数据, 保证其到根节点的路径上都满足堆的性质
     * 节点下标为i, 则父节点下标为(i-1)/2
     * O(logn)
     *
     * @param index
     */
    private void shiftUp(int index) {
        //达到根节点了
        if(index <= 0)
            return;

        int parent = (index-1)>>>1;

        //父亲节点 > 当前节点,需要交换
        if(parent >= 0 && queue[parent] > queue[index]) {
            swap(parent, index);
            shiftUp(parent);
        }
    }

    /**
     * 交换数组中 i 和 j 的元素
     * 
     * @param i
     * @param j
     */
    private void swap(int i, int j) {
        Integer tmp = queue[i];
        queue[i] = queue[j];
        queue[j] = tmp;
    }

    public static void main(String[] args) {
        PQueue queue = new PQueue();
        queue.add(6);
        queue.add(8);
        queue.add(2);
        queue.add(4);
        queue.add(5);
        queue.add(7);
        queue.add(1);

        while (!queue.isEmpty()) {
            System.out.println(queue.pop());    //1 2 4 5 6 7 8
        }
    }
}

Java Map

Jul 16th, 2014 10:45 pm

AbstractMap

Map的内部实现,key用一个set实现, value用一个Collection

transient volatile Set<K>        keySet = null;
transient volatile Collection<V> values = null;

HashMap

内部就一个Entry数组,有一个装载因子的概念

//默认容量
static final int DEFAULT_INITIAL_CAPACITY = 16;

//默认装载因子
static final float DEFAULT_LOAD_FACTOR = 0.75f;

//长度必须为2的幂次方
transient Entry[] table;

//阈值, = capacity * load factor, 当size达到这个值的时候需要搬家
int threshold;

//装载因子
final float loadFactor;

而Entry对象包含key,value,当前节点的hash值以及一下个Entry对象的指针

static class Entry<K,V> implements Map.Entry<K,V> {
        final K key;
        V value;
        Entry<K,V> next;
        final int hash;
}

hash函数及下标:

//hash函数, h为key.hashCode()值
static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

//通过hash()的返回值,获取在桶中的下标
static int indexFor(int h, int length) {
    return h & (length-1);
}

HashMap在多线程环境下会出现的死循环问题,详细. 有人把这个问题报给了Sun，不过Sun不认为这个是一个问题。因为HashMap本来就不支持并发。要并发就用ConcurrentHashmap ,详见.由此也可见,HashMap不是线程安全的.

HashTable

HashTable不允许value为null, 而HashMap允许.

默认Capacity为11, 负载因子0.75

public Hashtable() {
    this(11, 0.75f);
}

通过key找到其在桶中下标

int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;

HashTable线程安全的,是通过每个方法都加上synchronized关键字.

public synchronized V get(Object key) {
    Entry tab[] = table;
    int hash = key.hashCode();
    int index = (hash & 0x7FFFFFFF) % tab.length;
    for (Entry<K,V> e = tab[index] ; e != null ; e = e.next) {
        if ((e.hash == hash) && e.key.equals(key)) {
        return e.value;
        }
    }
    return null;
}

public synchronized V put(K key, V value) {
    ...
}

IdentityHashMap

在IdentityHashMap中，判断两个键值k1和 k2相等的条件是 k1 == k2 。在正常的Map 实现（如 HashMap）中，当且仅当满足下列条件时才认为两个键 k1 和 k2 相等：(k1==null ? k2==null : e1.equals(e2))。如下:

public boolean containsKey(Object key) {
    Object k = maskNull(key);
    Object[] tab = table;
    int len = tab.length;
    int i = hash(k, len);
    while (true) {
        Object item = tab[i];
        if (item == k)  //判断键值相等
            return true;
        if (item == null)
            return false;
        i = nextKeyIndex(i, len);
    }
}

简单示例:

IdentityHashMap<String,Object> map =newIdentityHashMap<String,Object>();

//这两个entry会同时存在, 因为两次new String("xx")地址不一样
map.put(new String("xx"),"first");
map.put(new String("xx"),"second");

Java Queue

Jul 16th, 2014 9:08 pm

ArrayDeque

双端队列Deque的一个实现

数组实现

数组实现,用下标head和tail表示头尾.

public class ArrayDeque<E> extends AbstractCollection<E>
                           implements Deque<E>, Cloneable, Serializable
{
    private transient E[] elements;

    private transient int head;

    private transient int tail;
}

默认大小16

public ArrayDeque() {
        elements = (E[]) new Object[16];
}

保证容量的策略,始终保证容量大小为2的幂次方.

private void allocateElements(int numElements) {
        int initialCapacity = MIN_INITIAL_CAPACITY; //MIN_INITIAL_CAPACITY = 8
        // 能容纳numElements个元素的最小的2的幂次方
        if (numElements >= initialCapacity) {
            initialCapacity = numElements;
            initialCapacity |= (initialCapacity >>>  1);
            initialCapacity |= (initialCapacity >>>  2);
            initialCapacity |= (initialCapacity >>>  4);
            initialCapacity |= (initialCapacity >>>  8);
            initialCapacity |= (initialCapacity >>> 16);
            initialCapacity++;

            if (initialCapacity < 0)   // Too many elements, must back off
                initialCapacity >>>= 1;// Good luck allocating 2 ^ 30 elements
        }
        elements = (E[]) new Object[initialCapacity];
    }

扩容策略

double扩容

private void doubleCapacity() {
        assert head == tail;
        int p = head;
        int n = elements.length;    // 当前容量大小
        int r = n - p;              // p右侧元素个数
        int newCapacity = n << 1;   // 新容量大小
        if (newCapacity < 0)
            throw new IllegalStateException("Sorry, deque too big");
        Object[] a = new Object[newCapacity];
        System.arraycopy(elements, p, a, 0, r); // 数据拷贝
        System.arraycopy(elements, 0, a, r, p);
        elements = (E[])a;
        head = 0;
        tail = n;
    }

PriorityQueue

优先级队列的一个实现数组实现,默认大小11,带比较器.优先级队列不允许 null 元素.

public class PriorityQueue<E> extends AbstractQueue<E>
    implements java.io.Serializable {
    private transient Object[] queue;

    private final Comparator<? super E> comparator; //比较器

    public PriorityQueue() {
        this(DEFAULT_INITIAL_CAPACITY, null);   //DEFAULT_INITIAL_CAPACITY = 11
    }
}

实现原理

PriorityQueue使用的是二叉堆实现的,用数组实现很方便.这里的实现和<算法导论>里面讲的基本一致.这里以小顶堆为例子.这个实现是非线程安全的.

二叉堆保证孩子节点大于等于父亲节点.同时它又具有完全二叉树(除了底层,其他层是满的)的性质.如下:

    1
  /   \
 2     3
/ \   / \
   4   5  6  7
  / \ / \
 8  9 10 11